Skip to content

Batch DescribeLogGroups calls #1717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 13, 2025
Merged

Batch DescribeLogGroups calls #1717

merged 17 commits into from
Jun 13, 2025

Conversation

duhminick
Copy link
Contributor

@duhminick duhminick commented Jun 4, 2025

Description of the issue

Currently, DescribeLogGroups (DLG) is used to determine the retention policy for a log group. This is done per log group. If a customer has a high number of agent deployments and/or a lot of log groups configured, throttling can be experienced.

Cloudwatch Logs has updated the DLG operation to allow for batching. As such, the agent should use the updated operation to help mitigate DLG throttling.

Description of changes

  • The AWS SDK has already been updated.

  • Modified the already existing go routine that processes the DLG channel

  • The updated routine will now read from the DLG channel then store it into a buffer that will later be the batch.

  • The batch will be processed and reset when a 50 item (max number of log groups that DLG can accept) limit is reached or when a 5 second timer ticks

    • The timer is an unfortunate necessity since at startup, the agent is not aware if the log groups are already ready. So we cannot rely on the agent configuration
  • The batch processing calls DLG then checks to see if the configured retention policy does not match. If it does not match, then the group is placed into the already existing PutRetentionPolicy (PRP) channel. The existing logic will then update the retention policy.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  1. Unit tests
  2. Integration tests

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@duhminick duhminick force-pushed the dominic-batch-dlg-pr branch from aa7b3c2 to fd94cc0 Compare June 4, 2025 16:36
@duhminick duhminick added the ready for testing Indicates this PR is ready for integration tests to run label Jun 4, 2025
@duhminick duhminick marked this pull request as ready for review June 4, 2025 20:42
@duhminick duhminick requested a review from a team as a code owner June 4, 2025 20:42
@duhminick duhminick force-pushed the dominic-batch-dlg-pr branch from 4ab4530 to 2be444c Compare June 12, 2025 23:26
@duhminick duhminick force-pushed the dominic-batch-dlg-pr branch from 962dd0b to c3019aa Compare June 13, 2025 15:24
m.logger.Errorf("failed to describe log group retention for target %v: %v", target, err)
time.Sleep(m.calculateBackoff(attempt))
continue
t := time.NewTicker(5 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure it's fine, but what's the reasoning behind the ticker vs timer. Are we anticipatingDescribeLogGroup calls minutes after start up?

Copy link
Contributor Author

@duhminick duhminick Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is possible, but maybe this covers the scenario I'm thinking of:

func (l *LogAgent) checkRetentionAlreadyAttempted(retention int, logGroup string) int {

Another scenario - my thinking is that it's not too safe to assume that it will only need to be called once with the timer. The system could be slow to initialize the targets so having this on a timer could potentially miss those log groups

retentionChannelSize = 100
retentionChannelSize = 100
cacheTTL = 5 * time.Second
logGroupIdentifierLimit = 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 50? I'm guessing this is the api limi?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's the API limit :/

@sky333999 sky333999 merged commit 89fed6f into main Jun 13, 2025
1 check passed
@sky333999 sky333999 deleted the dominic-batch-dlg-pr branch June 13, 2025 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for testing Indicates this PR is ready for integration tests to run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants