-
Notifications
You must be signed in to change notification settings - Fork 229
Batch DescribeLogGroups calls #1717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aa7b3c2
to
fd94cc0
Compare
Co-authored-by: Jeffrey Chien <[email protected]>
4ab4530
to
2be444c
Compare
962dd0b
to
c3019aa
Compare
m.logger.Errorf("failed to describe log group retention for target %v: %v", target, err) | ||
time.Sleep(m.calculateBackoff(attempt)) | ||
continue | ||
t := time.NewTicker(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure it's fine, but what's the reasoning behind the ticker vs timer. Are we anticipatingDescribeLogGroup
calls minutes after start up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is possible, but maybe this covers the scenario I'm thinking of:
amazon-cloudwatch-agent/logs/logs.go
Line 174 in 47683ec
func (l *LogAgent) checkRetentionAlreadyAttempted(retention int, logGroup string) int { |
Another scenario - my thinking is that it's not too safe to assume that it will only need to be called once with the timer. The system could be slow to initialize the targets so having this on a timer could potentially miss those log groups
retentionChannelSize = 100 | ||
retentionChannelSize = 100 | ||
cacheTTL = 5 * time.Second | ||
logGroupIdentifierLimit = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 50? I'm guessing this is the api limi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it's the API limit :/
Description of the issue
Currently, DescribeLogGroups (DLG) is used to determine the retention policy for a log group. This is done per log group. If a customer has a high number of agent deployments and/or a lot of log groups configured, throttling can be experienced.
Cloudwatch Logs has updated the DLG operation to allow for batching. As such, the agent should use the updated operation to help mitigate DLG throttling.
Description of changes
The AWS SDK has already been updated.
Modified the already existing go routine that processes the DLG channel
The updated routine will now read from the DLG channel then store it into a buffer that will later be the batch.
The batch will be processed and reset when a 50 item (max number of log groups that DLG can accept) limit is reached or when a 5 second timer ticks
The batch processing calls DLG then checks to see if the configured retention policy does not match. If it does not match, then the group is placed into the already existing PutRetentionPolicy (PRP) channel. The existing logic will then update the retention policy.
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Requirements
Before commit the code, please do the following steps.
make fmt
andmake fmt-sh
make lint