Description
Is your feature request related to a problem? Please describe.
Today, in OpenSearch, if you want to run different queries on the same data set chances are you will get different result as data is constantly changing. However, in real world scenario when analyzing data or trying to provide a consistent user experience to your end users you may want the result from a query not to change while the context remains the same and control when changes should appear in the result set. You want to be able to query the same data set and paginate through the data set expecting consistent result. This is not possible using current available options in OpenSearch.
Opensearch currently supports the following options to achieve pagination, each having a certain limitation:
- Scroll API : Scroll API cannot share point in time context with other queries. Moreover, the scroll API only allows to move forwards(next page) in the search, cases when the client sends the request for a page but fails to get a response, a subsequent retry call skips the page(retried for) and returns the next page in the scroll.
- Search After : The search_after mechanism doesn't preserve the state of data when the search was issued, so one can paginate using the key (search_after) and fetch subsequent pages while getting more recent results since the search was issued as the pagination progresses.
- From To : This mechanism does not support deep pagination since every page request requires the shard to process all previous results and then filter the requested page which might be taxing deeper the pagination goes
Describe the solution you'd like
Point in Time allows users to run different queries against the same fixed data set in time. Point in time only takes data into account up until the moment it is created. Hence, none of the resources that are required to return the data from the initial request are modified or deleted. Segments are retained, even though the segment might already have been merged away and is not needed for the live data set. In short, Point in Time Search allows user to maintain a state which can be re-used by different queries in order to achieve consistent results.
Key goals:
- Optimize resource consumption compared to a scroll by providing a consistent, shareable view of data set across queries. More segments are otherwise needed to be retained as needed by individual queries which means more file handles, more disk and more heap to keep metadata from segments in the heap.
- Resilient to
- Network failures : allows searches to move forward with a search_after parameter
- Shard failures for read-only data : allows retries on other shard copies that share the same segments (Phase - II)
- Replaces scroll API, as a more comprehensive solution for deep pagination when used with search_after
- Point in Time will be supported by Asynchronous Search and Cross Cluster searches
APIs
Create Point In Time API
Unlike a Scroll, by creating a dedicated Point in Time, we decouple the context from a single query and make it re-usable across arbitrary search requests by passing the Point in Time Id. We can achieve this by using the Create Point in Time API.
POST <index>/_point_in_time?keep_alive=1m
{
"id" : "s9O9QAIFaW5kZXgWOFVaMXFTc3pTV3lLMGE4VU42dmo4dwAWekthUVBmYnRUWk9XVzh4WW56TG5lZwAAAAAAAAAAARZQd3JkNlE4WlJicXRuS0M1VzNDaHV3BWluZGV4FjhVWjFxU3N6U1d5SzBhOFVONnZqOHcBFnpLYVFQZmJ0VFpPV1c4eFluekxuZWcAAAAAAAAAAAIWUHdyZDZROFpSYnF0bktDNVczQ2h1dwEWOFVaMXFTc3pTV3lLMGE4VU42dmo4dwAA",
"created_time" : 1632727466283,
"end_time" : 1632727526283
}
Delete Point In Time API
Point-in-times are automatically closed when the keep_alive is elapsed. However, keeping point-in-times has a cost; hence, point-in-times should be closed as soon as they are no longer used in search requests. We may also delete a Point in Time and free the resources before its keep alive using the Delete Point in Time API.
DELETE /_point_in_time/<id>
List All Active Point In Time API
A useful admin API to have is to list all active Points in Time and their keep-alives.
GET /_point_in_time
[
{
"point_in_time_id_1",
"created_time" : 1632727466283,
"end_time" : 1632727526283
},
{
"point_in_time_id_2",
"created_time" : 1674662833272,
"end_time" : 1632727526283
}
...
...
]
Using a Point in Time in a search request:
In the search request we pass the point in time id and (optionally) a keep alive to extend the Point In Time. (Passing PIT id in search request is supported in Opensearch)
Search request with PIT ID will not accept indices, preference, routing and indices options as these are already passed at the time of creating a Point In Time.
GET /_search
{
"pit": {
"id": "ID_RETURNED_FROM_CREATE_POINT_IN_TIME_REQUEST",
"keep_alive": "1m" //optional to extend a Point In Time
},
"sort": [
{
"name.keyword": {
"order": "desc"
}
}
],
"search_after" : ["Opensearch", 1] //optional to fetch further results
}