Skip to content

Optimizing Data Storage and Retrieval for Time Series data. #9568

Open
@nkumar04

Description

@nkumar04

Is your feature request related to a problem? Please describe.
In OpenSearch, documents are stored using three primary formats: indexed, stored, and docValues. In case of time series data, a document usually consists of dimensions, time point and quantitative measurements that are used monitor various aspects of a system, process, or phenomenon. In cases like these, the numeric, keyword and similar datatypes are stored as the "stored" field as well as docValues that serves specific purposes related to search performance and retrieval. DocValues are a columnar storage format used by Lucene to store indexed data in a way that facilitates efficient aggregation, sorting etc and stored fields, on the other hand, are used to store the actual values of fields as they were inserted into the index.
For example, lets look at a document consists of performance related data points of an ec2 instance.

{
	"hostName": "xyz",
	"hostIp": "x.x.x.x",
	"zone" : "us-east-1a"
	"@timestamp": 1693192062,
	"cpu": 80
	"jvm": 85
}

Here, hostName, hostIp and zone uses keyword as field type and timestamp/cpu/jvm uses numeric fields. Values for dimensions and mesurements are stored in docValue as well in stored fields as _source. The most common search query for such data set is aggregations like min/max/avg/sum. As we can see, data is stored twice here, we can possibly avoid storing data twice.

Describe the solution you'd like
Currently the _source field stores the original documents as stored field. We can possibly skip storing the _source field in such cases and retrieve the field values from docValue instead. This will help in reducing the storage cost significantly. Based on the nature of the query we can skip or fetch some or all of the fields from docValues to serve the search queries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingIndexing, Bulk Indexing and anything related to indexingPerformanceThis is for any performance related enhancements or bugsdiscussIssues intended to help drive brainstorming and decision makingenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    Status

    Open

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions