[Remote Store] Primary term validation with replicas - New approach POC

**Is your feature request related to a problem? Please describe.**

# Primary term validation with replicas

In reference to https://github.com/opensearch-project/OpenSearch/issues/3706, with segment replication and remote store for storing translog, storing translog on replicas becomes obsolete. Not just that, the in sync replication call to the replicas that happens during a write call becomes obsolete. And as we know, the replication call serves 2 use cases - 1) to replicate the data for durability and 2) primary term validation, while the 1st use case is taken cared off with using remote store for translog, the 2nd use case still needs to be handled.


## Challenges

Until now, we were planning to achieve the no-op by modifying the TransportBulk/WriteAction call and making it no-op. While we do not store any data, there was still one concern as these replication calls modifies the replication tracker state in the replica shard. With the older approach, we would have needed a no-op replication tracker proxy and needed to cut off all calls to replicas that updates the replication tracker on the replicas. This is to make sure that we are not updating any state on the replica - be it the data part (segment/translog) or the (global/local) checkpoints during the replication call or the async calls. This is a bit cumbersome on implementation making the code a lot intertwined putting a lot of logic on when to update checkpoints (replication tracker) vs when to not. This approach is a bit messier. Following things hinder or adds to complexity with older approach -

* Global/Local checkpoint calculation - Whenever a shard is marked to be inSync in the Checkpoint state, it starts to contribute in the calculation of global checkpoint on the primary. Since translog is not written on the replicas, global checkpoint calculation on primary should only consider its own local checkpoint and do not concern with the replicas. There is no meaning to local checkpoint on replicas since the data is replicated using segment replication. 
* Retention Leases - Whenever a shard is marked to be tracked in the Checkpoint state, there is an implicit expectation of a corresponding retention lease for the shard. Retention leases are useful in doing sequence based peer recovery. With translog on remote store, the retention lease do not make sense for recovering replicas. However, they are still required for primary-primary peer recovery.
    * In `assertInvariant()` method inside `ReplicationTracker` expects there is a PRRL present for each of the shard that is tracked in the `CheckpointState` in the `ReplicationTracker`. For replicas, since we do not have translogs present locally, PRRL is not required anymore. If a shard is tracked, it is implied that there are PRRL existing. 
* Every action that extends `TransportReplicationAction` is a replicated call which is first performed on the primary and then fans out to the active replicas. These calls performs some activity on the primary and same or different activity on the replica. The most common one is `TransportWriteAction`, `TransportBulkShardAction`  which are not required to be fanned out as we do not need the translogs written on replicas anymore. The below are other actions (that may or may not be required with remote store for translog) -
    * GlobalCheckpointAction - This syncs global checkpoint to replicas. This is **not required** for remote replicas.
    * PublishCheckpointAction - This publishes the checkpoint for segment replication on account of refreshes. This is no-op on primary and holds relevance for replicas to kickstart segment replication in async. This is **required**.
    * RetentionLeaseBackgroundSyncAction - On primary, it persists the retention lease state on disk. On replica, this currently copies over all the rentention leases from the primary and persists it on disk. This is a background job that is scheduled periodically. There is an open item [here](https://quip-amazon.com/yUIPA1jL3OAt/Primary-term-validation-with-replicas#temp:C:dJDabf788429b7f4720b17099f9a) if we were to remove this action altogether. This **might be required** for replicas.
    * TransportShardFlushAction - Executes flush request against the engine. This is **not required** for remote replicas as NRTReplicationEngine already makes the flush method in it’s engine as No-Op.
    * TransportShardRefreshAction - This writes all indexing change to disk and opens a new searcher. This is **not required** for remote replicas as NRTReplicationEngine already makes the refresh method in it’s engine as No-Op.
    * TransportVerifyShardBeforeCloseAction - verifies that the max seq no and global checkpoint are same on primary and replica. The flush method is no-op for replicas. This looks to be **not required** for replicas.
    * TransportVerifyShardIndexBlockAction - This ensures that an Index block has been applied on primary and replicas. This **might be required** for replicas.
    * TransportWriteAction -
        * TransportShardBulkAction - Write request. **Not required** on replicas anymore.
        * TransportResyncReplicationAction - When a replica shard is promoted to new primary, the other replicas are reset upto global checkpoint and this action is used to replay the lucene operations to make sure that other replica shards have the same consistent view as the new primary. With remote store, the need to reset engine and replay lucene operations becomes obsolete. This is **not required** on replicas any more.
        * RetentionLeaseSyncAction - Same as RetentionLeaseBackgroundSyncAction. This is triggered when a new retention lease is created or an existing retention lease is removed on the primary.
* All the above transport actions are fanned out to the replication target of the replication group. These shards are those which have tracked as true in the CheckpointState within the IndexShard’s ReplicationTracker. To implement no-op replication (primary term validation) with the TransportShardBulkAction would mean that we need to put a lot of logic for each of the `TransportReplicationAction` actions. This would make the code full of if/else conditions and at the same time make the code highly unreadable and unmaintainable over long time.
* Within `ReplicationTracker` class, there is a method `invariant()` which is used to check certain invariants with respect to the replication tracker. This ensures certain expected behaviour for retention leases, global checkpoint, replication group for shards that are tracked. 



## Proposal

However, this can be handled with a **separate call for primary term validation** (lets call it **primary validation call**) along side keeping the ***tracked*** and ***inSync*** as false in the **CheckpointState** in the **ReplicationTracker**. Whenever the cluster manager publishes the updated cluster state, the primary term would get updated on the replicas. When the **primary term validation** happens, the primary term supplied over the request is validated against the same. On incorrect primary term, the acknowledgement fails and the assuming isolated primary can fail itself. This also makes the approach and code change a bit cleaner.

With the new approach, we can do the following -

* Keep replicas of remote store enabled indices with `CheckpointState.tracked`, `CheckpointState.inSync` as false. - **Can be Handled.**
    * `tracked` = false makes the existing replication calls to not happen. 
    *  `inSync` = false allows us to not tweak code in ReplicationTracker class and so we create another variable called `validatePrimaryTerm`. This can be used for knowing which all replicas where we can send the **primary validation call**.
* Cluster manager will continue to send the cluster updates on this new node (where the replica shard is residing). Indirectly, the latest cluster state is always known to the replica shard inspite it being having tracked and inSync as false. - **Can be Handled.**
* The replica shard will get into STARTED state inspite of tracked and inSync as false. ShardStateAction - this is used for informing the cluster manager that a particular shard has started.  - **Can be Handled.**
* Segment replication will work without triggering the `shard.initiateTracking(..)` and `shard.markAllocationIdAsInSync(..)`. The request is fanned out to replicas by making a transport action which is no-op on primary and publishes checkpoint on replicas. If a shard is not tracked and not in sync, then Segment replication would stop to work. - **Can be Handled.**
* The active replica shards will continue to serve the search traffic. Adaptive replica selection - As long as we are informing the cluster manager about a shard being started, the shard selection for replicas account for the shards as earlier.  - **No code change required.**
* Cluster manager should be aware of all replicas that are available for primary promotion. Make appropriate code change to allow cluster manager to elect a new primary using the `validatePrimaryTerm` = true.  Currently, it probably uses the inSync allocations concept which it fetches from the cluster state. - **No code change required on Master Side**
* ̇We can have a wrapper over ReplicationTracker which disallows and throws exception for all methods that updates the state and when it is not expected to happen. - **Need to evaluate**
* Refactor Peer recovery and start the replica without doing initiate tracking and marking in sync. - **Can be Handled.**
* Primary term on replica/primary shard should get updated on account of shard state updates (cluster metadata updates). - **Can be Handled.**
* Segment replication currently uses Replication calls to propagate the publish checkpoint to the replicas. The replication calls tries to update the global checkpoint and maxSeqNoOfUpdatesOrDeletes (No-Op on NRT Engine). - **Yet to Handle.** Not hurting if we keep it like this anyways.
* Create a separate call for primary term validation - **Yet to Handle**


The above approach requires further deep dive and a POC -

* Master promoting another replica as primary
    * How is the `ShardRouting.state` changed to *`ShardRoutingState.STARTED`* or *`ShardRoutingState.RELOCATING`?*
* Primary to replica - TransportWriteAction - On what basis does the request gets fanned out.
* Segment replication - Since 
* Any implication on `hasAllPeerRecoveryRetentionLeases` in `ReplicationTracker` class.
* The active replica shards will continue to serve the search traffic. Adaptive replica selection - this will be updated to allow search requests to fan out across different active replica. 
* **ReplicationGroup** class has a derived field called ***replicationTargets***. This is used to fan out requests to replicas for general requests that needs to be replicated.
* ReplicationTracker has a method called `updateFromClusterManager`. In this, if an allocation id does not exist in the  **inSyncAllocationIds** or the **initializingAllocationIds** (derived from **RoutingTable**). Based on the **primaryMode** field value, the **checkpoints** are updated. This logic will be updated to account for **primaryTermValidation**.
* Primary term on replica/primary shard should get updated on account of shard state updates (cluster metadata updates).


Open Questions -

* Segment replication (PublishCheckpointAction) is build on top of TransportReplicationAction which relies on the replication target. 
* Currently RetentionLeaseBackgroundSyncAction on primary persists the retention leases to write the given state to the given directories and performs cleanup of old state files if the write succeeds or newly created state file if write fails. On Replicas, we copy the same retention leases and persist on the replicas too. With replicas having tracked as false, the replication calls do not go through and there are no retention leases. What is the consequence of this?


**Describe the solution you'd like**
Mentioned above.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Remote Store] Primary term validation with replicas - New approach POC #5033

Primary term validation with replicas

Challenges

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Remote Store] Primary term validation with replicas - New approach POC #5033

Description

Primary term validation with replicas

Challenges

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions