-
Notifications
You must be signed in to change notification settings - Fork 838
Reprioritize responses of GetReplicationMessagesResponse in frontend #6696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reprioritize responses of GetReplicationMessagesResponse in frontend #6696
Conversation
// if earliestCreationTime is equal, it will compare the size of the response | ||
func cmpGetReplicationMessagesWithSize(a, b *getReplicationMessagesWithSize) int { | ||
// a > b | ||
if a == nil || a.earliestCreationTime == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the case when earliestCreationTime is nil?
If we often have this case, then shouldn't we sort randomly instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! As I understand from the code, CreationTime
is initially non-pointer, but internally it changed to pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in general, we do not expect nil-s and its more like precaution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we don't expect nil-s there, but want to be sure that a NPE will never happen
if v == nil { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very unusual to check for reciver-ptr == nil. Is there an expected code-path which continues if * GetReplicationMessagesResponse is nil and operates with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All other methods of the structure do the same check. I assume that the structure can be nil in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Makes total sense for consistency.
common/types/replicator.go
Outdated
if earliestTime == nil { | ||
return nil | ||
} | ||
|
||
result := *earliestTime | ||
return &result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just return earliestTime
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not mandatory, however it's a protection of the variable (that got non directly) from being changed wrongly.
// if earliestCreationTime is equal, it will compare the size of the response | ||
func cmpGetReplicationMessagesWithSize(a, b *getReplicationMessagesWithSize) int { | ||
// a > b | ||
if a == nil || a.earliestCreationTime == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in general, we do not expect nil-s and its more like precaution?
if v == nil { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Makes total sense for consistency.
What changed?
In case of a shrunk response to the
GetReplicationMessages
call, the frontend service will give priority to responses from peers with the older creation time of replication tasks in responses from history services.Why?
A replication request comes from a passive side to an active side. A request aggregates requests over multiple shardIDs. Frontend service on the active side calls GetReplicationMessages per each peer (a peer contains multiple shards) to history services. The result is aggregated into one response and sent back to the passive side.
In some cases the result response may exceed max payload size (4MB by default). In this case, frontend service is used to choose randomly some responses. A replication lag could be caused if some child responses took the whole response, but could not be selected for response due to randomized sorting of peer response.
The PR changes the sorting logic and use sorting by CreationTime of replications tasks. Instead of randomized order, frontend will put responses with the older creation time of replication tasks first to the final response, and put with newer time later. It should prevent replication lag for shards which replication tasks are quite big.
The PR will allow to avoidance of manual actions from an on-call engineer to re-run the replication for stuck shards.
How did you test it?
Potential risks
Potentially other shards without big replication tasks may experience some bigger delays due to giving prioritization to older shards.
Release notes
Documentation Changes