-
Notifications
You must be signed in to change notification settings - Fork 575
Update matrix-stats aggregation. #9434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update matrix-stats aggregation. #9434
Conversation
Signed-off-by: Dave Welsch <[email protected]>
Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged. Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer. When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review. |
_aggregations/metric/matrix-stats.md
Outdated
| Parameter | Required/Optional | Data type | Description | | ||
| :-- | :-- | :-- | :-- | | ||
| `fields` | Required | String | An array of fields for which the matrix stats are computed. | | ||
| `mode` | Optional | String | Which value to use as a sample from a multi-valued or array field. Allowed values are `avg`, `min`, `max`, `sum`, and `median`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mode
did not seem to affect the result when I tested it with array data. Maybe it doesn't work when one field is a single value and another is an array? But that doesn't make much sense.
What is the default mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jainankitk Could you please answer this quiestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dwelsch-esi - Can you share the query that you ran?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jainankitk Try the following:
DELETE /students
POST _bulk
{ "create": { "_index": "students", "_id": "1" } }
{ "name": "John Doe", "gpa": 3.89, "class_grades": [3.0, 3.9, 4.0], "grad_year": 2022}
{ "create": { "_index": "students", "_id": "2" } }
{ "name": "Jonathan Powers", "gpa": 3.85, "class_grades": [4.1, 3.0, 4.0], "grad_year": 2025 }
{ "create": { "_index": "students", "_id": "3" } }
{ "name": "Jane Doe", "gpa": 3.52, "class_grades": [3.2, 2.1, 3.8], "grad_year": 2024 }
GET students/_search
{
"size": 0,
"aggs": {
"matrix_stats_taxful_total_price": {
"matrix_stats": {
"fields": ["gpa", "class_grades"],
"mode": "avg"
}
}
}
}
I tried different modes (min
, max
) instead of avg
. In all cases, the correlation between gpa
and class_grades
is the same, 0.9820867239098596. Surely it should be different if the value used for class_grades
is computed differently for different mode
values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dwelsch-esi for close observation. I noticed that the first response is cached resulting in this discrepancy. You can try using the GET students/_search?request_cache=false
parameter to avoid this issue. Or alternatively, POST /_cache/clear
before each search request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have PR for fixing this discrepancy - opensearch-project/OpenSearch#18254, in case anyone is curious
_aggregations/metric/matrix-stats.md
Outdated
@@ -9,8 +9,23 @@ redirect_from: | |||
|
|||
# Matrix stats aggregations | |||
|
|||
The `matrix_stats` aggregation generates advanced stats for multiple fields in a matrix form. | |||
The following example returns advanced stats in a matrix form for the `taxful_total_price` and `products.base_price` fields: | |||
The `matrix_stats` metric is a multi-value metric that generates covariance statistics for two or more fields in matrix form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Does
matrix-stats
always sample every available document? If not, how is sampling done? -
What assumptions does
matrix-stats
make about independence of data samples? Is it sufficent to say that it assumes documents are independent of each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again will probably keep it as multi-value metric aggregation
and maybe link to the metric aggregation page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jainankitk Please see my technical questions throughout this file. Can you provide any insights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jainankitk Could you please answer this quiestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does matrix-stats always sample every available document? If not, how is sampling done?
I don't think there is any sampling, all the documents are processed
What assumptions does matrix-stats make about independence of data samples? Is it sufficent to say that it assumes documents are independent of each other?
Matrix-stats
does not make any assumptions about the data. I am wondering why does it need to make any assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jainankitk There are several assumptions that have to be met for a correlation to be valid: https://www.statology.org/pearson-correlation-assumptions/. However, that level of detail is probably beyond the scope of the documentation here. At some level, the user is assumed to know what they're doing if they use an aggregation.
@jainankitk Could you please review this PR? Thanks! |
_aggregations/metric/matrix-stats.md
Outdated
@@ -9,8 +9,23 @@ redirect_from: | |||
|
|||
# Matrix stats aggregations | |||
|
|||
The `matrix_stats` aggregation generates advanced stats for multiple fields in a matrix form. | |||
The following example returns advanced stats in a matrix form for the `taxful_total_price` and `products.base_price` fields: | |||
The `matrix_stats` metric is a multi-value metric that generates covariance statistics for two or more fields in matrix form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again will probably keep it as multi-value metric aggregation
and maybe link to the metric aggregation page
Co-authored-by: kolchfa-aws <[email protected]> Signed-off-by: Dave Welsch <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
@jainankitk Thanks for your feedback. I addressed your comments and added a section about the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws One comment and a couple of changes. Thanks!
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>
* Update matrix-stats aggregation Signed-off-by: Dave Welsch <[email protected]> * Apply suggestions from code review Co-authored-by: kolchfa-aws <[email protected]> Signed-off-by: Dave Welsch <[email protected]> * Doc review Signed-off-by: Fanit Kolchina <[email protected]> * Add missing parameter section Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Signed-off-by: kolchfa-aws <[email protected]> * Update _aggregations/metric/matrix-stats.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> --------- Signed-off-by: Dave Welsch <[email protected]> Signed-off-by: Dave Welsch <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: Nathan Bower <[email protected]> (cherry picked from commit c8abc5e) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Update matrix-stats aggregation Signed-off-by: Dave Welsch <[email protected]> * Apply suggestions from code review Co-authored-by: kolchfa-aws <[email protected]> Signed-off-by: Dave Welsch <[email protected]> * Doc review Signed-off-by: Fanit Kolchina <[email protected]> * Add missing parameter section Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Signed-off-by: kolchfa-aws <[email protected]> * Update _aggregations/metric/matrix-stats.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> --------- Signed-off-by: Dave Welsch <[email protected]> Signed-off-by: Dave Welsch <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
Description
Updated descriptions. Added explanation of result statistics.
This PR will probably require some major revisions before it's published. I have several questions. See line comments.
Issues Resolved
Version
Frontend features
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.