Description
Is your feature request related to a problem? Please describe.
Primary term validation with replicas
In reference to #3706, with segment replication and remote store for storing translog, storing translog on replicas becomes obsolete. Not just that, the in sync replication call to the replicas that happens during a write call becomes obsolete. And as we know, the replication call serves 2 use cases - 1) to replicate the data for durability and 2) primary term validation, while the 1st use case is taken cared off with using remote store for translog, the 2nd use case still needs to be handled.
Challenges
Until now, we were planning to achieve the no-op by modifying the TransportBulk/WriteAction call and making it no-op. While we do not store any data, there was still one concern as these replication calls modifies the replication tracker state in the replica shard. With the older approach, we would have needed a no-op replication tracker proxy and needed to cut off all calls to replicas that updates the replication tracker on the replicas. This is to make sure that we are not updating any state on the replica - be it the data part (segment/translog) or the (global/local) checkpoints during the replication call or the async calls. This is a bit cumbersome on implementation making the code a lot intertwined putting a lot of logic on when to update checkpoints (replication tracker) vs when to not. This approach is a bit messier. Following things hinder or adds to complexity with older approach -
- Global/Local checkpoint calculation - Whenever a shard is marked to be inSync in the Checkpoint state, it starts to contribute in the calculation of global checkpoint on the primary. Since translog is not written on the replicas, global checkpoint calculation on primary should only consider its own local checkpoint and do not concern with the replicas. There is no meaning to local checkpoint on replicas since the data is replicated using segment replication.
- Retention Leases - Whenever a shard is marked to be tracked in the Checkpoint state, there is an implicit expectation of a corresponding retention lease for the shard. Retention leases are useful in doing sequence based peer recovery. With translog on remote store, the retention lease do not make sense for recovering replicas. However, they are still required for primary-primary peer recovery.
- In
assertInvariant()
method insideReplicationTracker
expects there is a PRRL present for each of the shard that is tracked in theCheckpointState
in theReplicationTracker
. For replicas, since we do not have translogs present locally, PRRL is not required anymore. If a shard is tracked, it is implied that there are PRRL existing.
- In
- Every action that extends
TransportReplicationAction
is a replicated call which is first performed on the primary and then fans out to the active replicas. These calls performs some activity on the primary and same or different activity on the replica. The most common one isTransportWriteAction
,TransportBulkShardAction
which are not required to be fanned out as we do not need the translogs written on replicas anymore. The below are other actions (that may or may not be required with remote store for translog) -- GlobalCheckpointAction - This syncs global checkpoint to replicas. This is not required for remote replicas.
- PublishCheckpointAction - This publishes the checkpoint for segment replication on account of refreshes. This is no-op on primary and holds relevance for replicas to kickstart segment replication in async. This is required.
- RetentionLeaseBackgroundSyncAction - On primary, it persists the retention lease state on disk. On replica, this currently copies over all the rentention leases from the primary and persists it on disk. This is a background job that is scheduled periodically. There is an open item here if we were to remove this action altogether. This might be required for replicas.
- TransportShardFlushAction - Executes flush request against the engine. This is not required for remote replicas as NRTReplicationEngine already makes the flush method in it’s engine as No-Op.
- TransportShardRefreshAction - This writes all indexing change to disk and opens a new searcher. This is not required for remote replicas as NRTReplicationEngine already makes the refresh method in it’s engine as No-Op.
- TransportVerifyShardBeforeCloseAction - verifies that the max seq no and global checkpoint are same on primary and replica. The flush method is no-op for replicas. This looks to be not required for replicas.
- TransportVerifyShardIndexBlockAction - This ensures that an Index block has been applied on primary and replicas. This might be required for replicas.
- TransportWriteAction -
- TransportShardBulkAction - Write request. Not required on replicas anymore.
- TransportResyncReplicationAction - When a replica shard is promoted to new primary, the other replicas are reset upto global checkpoint and this action is used to replay the lucene operations to make sure that other replica shards have the same consistent view as the new primary. With remote store, the need to reset engine and replay lucene operations becomes obsolete. This is not required on replicas any more.
- RetentionLeaseSyncAction - Same as RetentionLeaseBackgroundSyncAction. This is triggered when a new retention lease is created or an existing retention lease is removed on the primary.
- All the above transport actions are fanned out to the replication target of the replication group. These shards are those which have tracked as true in the CheckpointState within the IndexShard’s ReplicationTracker. To implement no-op replication (primary term validation) with the TransportShardBulkAction would mean that we need to put a lot of logic for each of the
TransportReplicationAction
actions. This would make the code full of if/else conditions and at the same time make the code highly unreadable and unmaintainable over long time. - Within
ReplicationTracker
class, there is a methodinvariant()
which is used to check certain invariants with respect to the replication tracker. This ensures certain expected behaviour for retention leases, global checkpoint, replication group for shards that are tracked.
Proposal
However, this can be handled with a separate call for primary term validation (lets call it primary validation call) along side keeping the tracked and inSync as false in the CheckpointState in the ReplicationTracker. Whenever the cluster manager publishes the updated cluster state, the primary term would get updated on the replicas. When the primary term validation happens, the primary term supplied over the request is validated against the same. On incorrect primary term, the acknowledgement fails and the assuming isolated primary can fail itself. This also makes the approach and code change a bit cleaner.
With the new approach, we can do the following -
- Keep replicas of remote store enabled indices with
CheckpointState.tracked
,CheckpointState.inSync
as false. - Can be Handled.tracked
= false makes the existing replication calls to not happen.inSync
= false allows us to not tweak code in ReplicationTracker class and so we create another variable calledvalidatePrimaryTerm
. This can be used for knowing which all replicas where we can send the primary validation call.
- Cluster manager will continue to send the cluster updates on this new node (where the replica shard is residing). Indirectly, the latest cluster state is always known to the replica shard inspite it being having tracked and inSync as false. - Can be Handled.
- The replica shard will get into STARTED state inspite of tracked and inSync as false. ShardStateAction - this is used for informing the cluster manager that a particular shard has started. - Can be Handled.
- Segment replication will work without triggering the
shard.initiateTracking(..)
andshard.markAllocationIdAsInSync(..)
. The request is fanned out to replicas by making a transport action which is no-op on primary and publishes checkpoint on replicas. If a shard is not tracked and not in sync, then Segment replication would stop to work. - Can be Handled. - The active replica shards will continue to serve the search traffic. Adaptive replica selection - As long as we are informing the cluster manager about a shard being started, the shard selection for replicas account for the shards as earlier. - No code change required.
- Cluster manager should be aware of all replicas that are available for primary promotion. Make appropriate code change to allow cluster manager to elect a new primary using the
validatePrimaryTerm
= true. Currently, it probably uses the inSync allocations concept which it fetches from the cluster state. - No code change required on Master Side - ̇We can have a wrapper over ReplicationTracker which disallows and throws exception for all methods that updates the state and when it is not expected to happen. - Need to evaluate
- Refactor Peer recovery and start the replica without doing initiate tracking and marking in sync. - Can be Handled.
- Primary term on replica/primary shard should get updated on account of shard state updates (cluster metadata updates). - Can be Handled.
- Segment replication currently uses Replication calls to propagate the publish checkpoint to the replicas. The replication calls tries to update the global checkpoint and maxSeqNoOfUpdatesOrDeletes (No-Op on NRT Engine). - Yet to Handle. Not hurting if we keep it like this anyways.
- Create a separate call for primary term validation - Yet to Handle
The above approach requires further deep dive and a POC -
- Master promoting another replica as primary
- How is the
ShardRouting.state
changed toShardRoutingState.STARTED
orShardRoutingState.RELOCATING
?
- How is the
- Primary to replica - TransportWriteAction - On what basis does the request gets fanned out.
- Segment replication - Since
- Any implication on
hasAllPeerRecoveryRetentionLeases
inReplicationTracker
class. - The active replica shards will continue to serve the search traffic. Adaptive replica selection - this will be updated to allow search requests to fan out across different active replica.
- ReplicationGroup class has a derived field called replicationTargets. This is used to fan out requests to replicas for general requests that needs to be replicated.
- ReplicationTracker has a method called
updateFromClusterManager
. In this, if an allocation id does not exist in the inSyncAllocationIds or the initializingAllocationIds (derived from RoutingTable). Based on the primaryMode field value, the checkpoints are updated. This logic will be updated to account for primaryTermValidation. - Primary term on replica/primary shard should get updated on account of shard state updates (cluster metadata updates).
Open Questions -
- Segment replication (PublishCheckpointAction) is build on top of TransportReplicationAction which relies on the replication target.
- Currently RetentionLeaseBackgroundSyncAction on primary persists the retention leases to write the given state to the given directories and performs cleanup of old state files if the write succeeds or newly created state file if write fails. On Replicas, we copy the same retention leases and persist on the replicas too. With replicas having tracked as false, the replication calls do not go through and there are no retention leases. What is the consequence of this?
Describe the solution you'd like
Mentioned above.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.