Skip to content

Reprioritize responses of GetReplicationMessagesResponse in frontend #6696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

arzonus
Copy link
Contributor

@arzonus arzonus commented Mar 3, 2025

What changed?
In case of a shrunk response to the GetReplicationMessages call, the frontend service will give priority to responses from peers with the older creation time of replication tasks in responses from history services.

Why?

A replication request comes from a passive side to an active side. A request aggregates requests over multiple shardIDs. Frontend service on the active side calls GetReplicationMessages per each peer (a peer contains multiple shards) to history services. The result is aggregated into one response and sent back to the passive side.

In some cases the result response may exceed max payload size (4MB by default). In this case, frontend service is used to choose randomly some responses. A replication lag could be caused if some child responses took the whole response, but could not be selected for response due to randomized sorting of peer response.

The PR changes the sorting logic and use sorting by CreationTime of replications tasks. Instead of randomized order, frontend will put responses with the older creation time of replication tasks first to the final response, and put with newer time later. It should prevent replication lag for shards which replication tasks are quite big.

The PR will allow to avoidance of manual actions from an on-call engineer to re-run the replication for stuck shards.

How did you test it?

  • Unit tests
  • Manual tests on staging envs

Potential risks
Potentially other shards without big replication tasks may experience some bigger delays due to giving prioritization to older shards.

Release notes

Documentation Changes

// if earliestCreationTime is equal, it will compare the size of the response
func cmpGetReplicationMessagesWithSize(a, b *getReplicationMessagesWithSize) int {
// a > b
if a == nil || a.earliestCreationTime == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the case when earliestCreationTime is nil?
If we often have this case, then shouldn't we sort randomly instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! As I understand from the code, CreationTime is initially non-pointer, but internally it changed to pointer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in general, we do not expect nil-s and its more like precaution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we don't expect nil-s there, but want to be sure that a NPE will never happen

Comment on lines +310 to +312
if v == nil {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very unusual to check for reciver-ptr == nil. Is there an expected code-path which continues if * GetReplicationMessagesResponse is nil and operates with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other methods of the structure do the same check. I assume that the structure can be nil in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Makes total sense for consistency.

Comment on lines 640 to 645
if earliestTime == nil {
return nil
}

result := *earliestTime
return &result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just return earliestTime ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not mandatory, however it's a protection of the variable (that got non directly) from being changed wrongly.

@arzonus arzonus changed the title Reprioritize responses of GetReplicationMessagesResponse Reprioritize responses of GetReplicationMessagesResponse in frontend Mar 4, 2025
// if earliestCreationTime is equal, it will compare the size of the response
func cmpGetReplicationMessagesWithSize(a, b *getReplicationMessagesWithSize) int {
// a > b
if a == nil || a.earliestCreationTime == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in general, we do not expect nil-s and its more like precaution?

Comment on lines +310 to +312
if v == nil {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Makes total sense for consistency.

@arzonus arzonus merged commit fafe9b6 into cadence-workflow:master Mar 5, 2025
22 checks passed
@arzonus arzonus deleted the fix-replication-messages-not-fit-in-response branch March 5, 2025 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants