Skip to content

Update matrix-stats aggregation. #9434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 22, 2025

Conversation

dwelsch-esi
Copy link
Contributor

Description

Updated descriptions. Added explanation of result statistics.

This PR will probably require some major revisions before it's published. I have several questions. See line comments.

Issues Resolved

Version

Frontend features

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

| Parameter | Required/Optional | Data type | Description |
| :-- | :-- | :-- | :-- |
| `fields` | Required | String | An array of fields for which the matrix stats are computed. |
| `mode` | Optional | String | Which value to use as a sample from a multi-valued or array field. Allowed values are `avg`, `min`, `max`, `sum`, and `median`. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode did not seem to affect the result when I tested it with array data. Maybe it doesn't work when one field is a single value and another is an array? But that doesn't make much sense.

What is the default mode?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainankitk Could you please answer this quiestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwelsch-esi - Can you share the query that you ran?

Copy link
Contributor Author

@dwelsch-esi dwelsch-esi Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainankitk Try the following:

DELETE /students

POST _bulk
{ "create": { "_index": "students", "_id": "1" } }
{ "name": "John Doe", "gpa": 3.89, "class_grades": [3.0, 3.9, 4.0], "grad_year": 2022}
{ "create": { "_index": "students", "_id": "2" } }
{ "name": "Jonathan Powers", "gpa": 3.85, "class_grades": [4.1, 3.0, 4.0], "grad_year": 2025 }
{ "create": { "_index": "students", "_id": "3" } }
{ "name": "Jane Doe", "gpa": 3.52, "class_grades": [3.2, 2.1, 3.8], "grad_year": 2024 }

GET students/_search
{
  "size": 0,
  "aggs": {
    "matrix_stats_taxful_total_price": {
      "matrix_stats": {
        "fields": ["gpa", "class_grades"],
        "mode": "avg"
      }
    }
  }
}

I tried different modes (min, max) instead of avg. In all cases, the correlation between gpa and class_grades is the same, 0.9820867239098596. Surely it should be different if the value used for class_grades is computed differently for different mode values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dwelsch-esi for close observation. I noticed that the first response is cached resulting in this discrepancy. You can try using the GET students/_search?request_cache=false parameter to avoid this issue. Or alternatively, POST /_cache/clear before each search request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have PR for fixing this discrepancy - opensearch-project/OpenSearch#18254, in case anyone is curious

@@ -9,8 +9,23 @@ redirect_from:

# Matrix stats aggregations

The `matrix_stats` aggregation generates advanced stats for multiple fields in a matrix form.
The following example returns advanced stats in a matrix form for the `taxful_total_price` and `products.base_price` fields:
The `matrix_stats` metric is a multi-value metric that generates covariance statistics for two or more fields in matrix form.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Does matrix-stats always sample every available document? If not, how is sampling done?

  2. What assumptions does matrix-stats make about independence of data samples? Is it sufficent to say that it assumes documents are independent of each other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again will probably keep it as multi-value metric aggregation and maybe link to the metric aggregation page

Copy link
Contributor Author

@dwelsch-esi dwelsch-esi Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainankitk Please see my technical questions throughout this file. Can you provide any insights?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainankitk Could you please answer this quiestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does matrix-stats always sample every available document? If not, how is sampling done?

I don't think there is any sampling, all the documents are processed

What assumptions does matrix-stats make about independence of data samples? Is it sufficent to say that it assumes documents are independent of each other?

Matrix-stats does not make any assumptions about the data. I am wondering why does it need to make any assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainankitk There are several assumptions that have to be met for a correlation to be valid: https://www.statology.org/pearson-correlation-assumptions/. However, that level of detail is probably beyond the scope of the documentation here. At some level, the user is assumed to know what they're doing if they use an aggregation.

@kolchfa-aws
Copy link
Collaborator

@jainankitk Could you please review this PR? Thanks!

@@ -9,8 +9,23 @@ redirect_from:

# Matrix stats aggregations

The `matrix_stats` aggregation generates advanced stats for multiple fields in a matrix form.
The following example returns advanced stats in a matrix form for the `taxful_total_price` and `products.base_price` fields:
The `matrix_stats` metric is a multi-value metric that generates covariance statistics for two or more fields in matrix form.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again will probably keep it as multi-value metric aggregation and maybe link to the metric aggregation page

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: Dave Welsch <[email protected]>
@kolchfa-aws
Copy link
Collaborator

@jainankitk Thanks for your feedback. I addressed your comments and added a section about the missing parameter.

Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws One comment and a couple of changes. Thanks!

kolchfa-aws and others added 2 commits May 21, 2025 11:36
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
@kolchfa-aws kolchfa-aws merged commit c8abc5e into opensearch-project:main May 22, 2025
8 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 22, 2025
* Update matrix-stats aggregation

Signed-off-by: Dave Welsch <[email protected]>

* Apply suggestions from code review

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: Dave Welsch <[email protected]>

* Doc review

Signed-off-by: Fanit Kolchina <[email protected]>

* Add missing parameter section

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Signed-off-by: kolchfa-aws <[email protected]>

* Update _aggregations/metric/matrix-stats.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Dave Welsch <[email protected]>
Signed-off-by: Dave Welsch <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Fanit Kolchina <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
(cherry picked from commit c8abc5e)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
github-actions bot pushed a commit that referenced this pull request May 22, 2025
KishoreKicha14 pushed a commit to KishoreKicha14/documentation-website that referenced this pull request Jun 13, 2025
* Update matrix-stats aggregation

Signed-off-by: Dave Welsch <[email protected]>

* Apply suggestions from code review

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: Dave Welsch <[email protected]>

* Doc review

Signed-off-by: Fanit Kolchina <[email protected]>

* Add missing parameter section

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Signed-off-by: kolchfa-aws <[email protected]>

* Update _aggregations/metric/matrix-stats.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Dave Welsch <[email protected]>
Signed-off-by: Dave Welsch <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Fanit Kolchina <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants