Skip to content

CI Caching #664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Dec 19, 2023
Merged

CI Caching #664

merged 12 commits into from
Dec 19, 2023

Conversation

JustinStitt
Copy link
Contributor

@JustinStitt JustinStitt commented Dec 2, 2023

This PR aims to implement a caching system to reduce the total amount of Tuxsuite build minutes used. A previous historical audit of our workflows showed that 17% of builds were duplicated and may have been cacheable.

Initially, it seemed a dispatcher architecture like the one KernelCI uses was the right idea. However, there are some problems with that approach:

  • Choosing a dispatch frequency that properly suited all jobs in the matrix is not feasible. Say we choose "Once a Day", some workflows aren't worth running that frequently (even if some piece of their state has changed -- toolchain or kernel sha). We want more control in terms of frequency per job.

  • There may be a bandwidth issue with both GitHub and Tuxsuite. If the dispatcher has a lot of cache misses then it will spin up loads of jobs all at once because, again, we're missing control over job frequency. A large amount of jobs being queued may cause more jobs to sporadically error-out on Tuxsuite's end and may even cause delays on GitHub's end as only so many workflows can run at once.

  • It's a bit more difficult to propagate job statuses back to the dispatcher as we have to wait until Tuxsuite does its thing. So the dispatcher has to periodically check back in hopes it gets a status update. However, with a self-contained workflow it is ran linearly and job status is easier to manage in the cache.

To remedy all of these problems, a different approach is required. Looking back at #522 the main problem was the shear number of branches that we had to spin up to use as our caching system. However, with the introduction of GitHub Repository Variables we have a sleek key:value store that has shared storage across workflows. There is now no need to create a bunch of branches or a separate repo.

This PR is inspired by #522 but with an interface to use these Repository Variables and some other opinionated changes.

For starters, I implemented this understanding what Nathan was doing in #522 but without following each implementation detail. Due to this, there are many differences and this PR deserves its own round of code review.

A current CI workflow run diagram may look like this:
image

A cache-implemented CI workflow run diagram will look something like this:
image

Notice how future jobs neither pass or fail but rather are stopped upon cache hit + validation. This makes it clear to see when caching stepped in

There is now one lightweight step before Tuxsuite does its thing. We spin up a container and run check_cache.py. This script's whole purpose in life is to determine whether or not a Tuxsuite job should run. Here's the criteria:

(lingo: cache hit means same linux sha and same compiler version in cache compared to current)

  • If we cache hit we check the status against some allowed statuses like pass or fail. These statuses can only be set by a completed Tuxsuite job and won't be set by random Tuxsuite timeouts or failures. For those cases, a status presuite is assigned. If our cache hit is not explicitly pass or fail we continue on to run the whole Tuxsuite plan. This is to prevent us from caching random one-off failures.

  • If we cache hit and it is a "cache-able" status like pass or fail then we do not run the Tuxsuite plan and instead return an exit code such that GitHub knows what color to set the matrix. To be clear, if a build has a cache hit and the cached status was pass then the Workflow will not run the Tuxsuite plan but it will still return a green square for our matrix. Likewise for red and fail. This means that even with caching our matrix will still be colored correctly.

  • If we cache miss, we update our cache to most recent versions and sha's and assign a presuite status. The Tuxsuite plan continues as normal. After the Tuxsuite plan finishes, some tooling in check_logs.py will kick-in and analyze the status.json to properly update our cached status for this particular workflow. This will ensure that future runs of this workflow have accurate statuses to validate their cache off of.

  • Now, if there is not yet a cache entry for a workflow run (it has yet to be run under this new system) then a cache entry will be created for it and it will continue as normal (described above).

check_cache.py can fail and will fail (sometimes due to GitHub rate-limiting or cosmic bit flips). However, all the workflow logic is designed to just ignore its failures and continue on as if we had a cache miss. This entirely eliminates false-positives that we may have seen. Now, we are just lowering our cache efficiency -- this is a far better outcome.

Questions you may have:

  • What is REPO_SCOPED_PAT?

It's a github token with the minimal permissions scope to handle Repository Variables

  • What about BOOTing? Can the CI cache BOOT failures and successes.

I think so, I implemented it but we need a full end-to-end test on main

check_cache.py Outdated
"linux_sha": linux_sha,
"clang_version": clang_version,
"build_status": "presuite",
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to store any info about the patches, so that we can detect when the patches to be applied change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes totally. I considered this but didn't want to add more layers before testing the core functionality. At the very least, there is a workflow_dispatch backdoor that means when a patchset is changed we can manually dispatch a new run and caching will step out of the way.

tl;dr this feature is wip

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple canary for patches would be getting the hash of the series file if there is one present, as that will have to be updated any time that we change patch files. I wouldn't be as concerned with updating the patch files themselves, that is usually for metadata updates, which shouldn't affect the final result.

Signed-off-by: Justin Stitt <[email protected]>
@nathanchance
Copy link
Member

Justin and I chatted in person about this implementation today and I think it is looking solid so far. However, one major issue that we realized with the current implementation is calling the GitHub API in check_logs.py is potentially run into both

  • rate limit issues, although I think if we supply our own token rather than using GitHub's, which is what is being done in this PR through REPO_SCOPED_PAT, we have 5000 requests per hour, which should be fine.
  • synchronization issues, as one build may update the build_status variable from pass to fail then another might undo that.

To fix both of these issues, I think we should have a step between the tuxsuite_ jobs that opens all the builds.json generated by tuxsuite, iterates through all builds, and updates the status atomically for the whole workflow (if any build failed, it is a fail, otherwise it is a pass). It is probably worth populating the cache at this point with all three variables (linux revision, clang version, and build result), rather than doing in it in check_cache.py to be slightly more accurate. While removing the caching from check_logs.py means that we won't be able to cache on boot results, I think that is fine since it simplifies the implementation. In general, I re-run jobs that have spurious boot failures and they do not happen all too often anyways.

Signed-off-by: Justin Stitt <[email protected]>
Signed-off-by: Justin Stitt <[email protected]>
@JustinStitt
Copy link
Contributor Author

Added a new batch of changes to address some issues brought to my attention after meeting with @nathanchance. Now, each kick_tuxsuite_foo job will have an additional step where it will update the cache build status for the workflow. This is preferable to relying on each individual check_logs.py job to update the cache as we run into rate-limiting and concurrency issues. With the new approach there is absolutely no way in heck that all three kick_tuxsuite_foo jobs end at the same time and write garbage to the cache.

As always, if any part of the caching infra fails it should step out of the way and let the Tuxsuite plan and GitHub action continue as normal. This caching system is simply an addon and shouldn't drastically modify existing behavior that you would expect.

@JustinStitt
Copy link
Contributor Author

Side note: I'll install ruff locally and address all these linting concerns right before we merge -- but while stuff is still in motion I'm not going to try and constantly appease the linter gods.

@JustinStitt
Copy link
Contributor Author

JustinStitt commented Dec 6, 2023

And here's a diagram


not shown here is that workflow_dispatch operates as a backdoor straight to Tuxsuite Plan and that techincally a "Cache Miss" also encompasses a "Cache Hit" that did not have a cacheable build_status like pass or fail.

@nathanchance
Copy link
Member

Yeah don't bother with the linting stuff until we settle on everything. As a matter of fact, could you remove all the workflow changes but just one so that we can still see what the workflow changes will look like but not get drowned by them when reviewing the whole PR?

I won't be able to fully review this as I am going to be out the next two days plus the weekend but I will try to make it a priority to take a closer look on Monday so we can talk through any major issues or concerns on Tuesday. That diagram is massively helpful for following what is going on here, thanks for making it!

@JustinStitt
Copy link
Contributor Author

... As a matter of fact, could you remove all the workflow changes but just one so that we can still see what the workflow changes will look like but not get drowned by them when reviewing the whole PR?

Sure, I could do this.

But have you seen the new file tree in the GitHub review tool? If you click "files changed" then look on the left you should be able to expand a file tree. Doing this, you can hide the hundreds of .yml changes and get right to what matters.

If you knew about this already and still want me to undo all but one of the .yml changes let me know 👍

Copy link
Member

@nathanchance nathanchance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some superficial comments below. I like the overall structure of this more than the first iteration, I am going to think overnight and tomorrow morning about various edge cases but I think it is potentially worth shipping this and see what explodes :) great work so far!

check_cache.py Outdated
"linux_sha": linux_sha,
"clang_version": clang_version,
"build_status": "presuite",
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple canary for patches would be getting the hash of the series file if there is one present, as that will have to be updated any time that we change patch files. I wouldn't be as concerned with updating the patch files themselves, that is usually for metadata updates, which shouldn't affect the final result.

Comment on lines 108 to 111
"run": "python check_cache.py -w '${{github.workflow}}' "
"-g ${{secrets.REPO_SCOPED_PAT}} "
"-r ${{env.GIT_REF}} "
"-o ${{env.GIT_REPO}}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding spaces between the ${{/}} and the name of the variable to match the style of the rest of the workflow.

@JustinStitt
Copy link
Contributor Author

The linter gods have been appeased.

Copy link
Member

@nathanchance nathanchance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in a good enough spot to merge now and follow up with issues and individual changes.

My plan is to squash and merge, change one job's cron string manually, then push to main to start a cycle with this added into try and catch any major issues before a bunch of other jobs kick off.

If there are any issues, we can either forward fix or revert, depending on the nature of the issue.

@nathanchance nathanchance merged commit 277ae6d into main Dec 19, 2023
@nathanchance nathanchance deleted the ci-caching branch December 19, 2023 22:35
@nathanchance nathanchance mentioned this pull request Dec 20, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants