-
Notifications
You must be signed in to change notification settings - Fork 8
CI Caching #664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Caching #664
Conversation
check_cache.py
Outdated
"linux_sha": linux_sha, | ||
"clang_version": clang_version, | ||
"build_status": "presuite", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to store any info about the patches, so that we can detect when the patches to be applied change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes totally. I considered this but didn't want to add more layers before testing the core functionality. At the very least, there is a workflow_dispatch backdoor that means when a patchset is changed we can manually dispatch a new run and caching will step out of the way.
tl;dr this feature is wip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple canary for patches would be getting the hash of the series
file if there is one present, as that will have to be updated any time that we change patch files. I wouldn't be as concerned with updating the patch files themselves, that is usually for metadata updates, which shouldn't affect the final result.
aa2984c
to
0cc2a7c
Compare
Signed-off-by: Justin Stitt <[email protected]>
Justin and I chatted in person about this implementation today and I think it is looking solid so far. However, one major issue that we realized with the current implementation is calling the GitHub API in
To fix both of these issues, I think we should have a step between the |
Signed-off-by: Justin Stitt <[email protected]>
Signed-off-by: Justin Stitt <[email protected]>
Added a new batch of changes to address some issues brought to my attention after meeting with @nathanchance. Now, each As always, if any part of the caching infra fails it should step out of the way and let the Tuxsuite plan and GitHub action continue as normal. This caching system is simply an addon and shouldn't drastically modify existing behavior that you would expect. |
Side note: I'll install |
Yeah don't bother with the linting stuff until we settle on everything. As a matter of fact, could you remove all the workflow changes but just one so that we can still see what the workflow changes will look like but not get drowned by them when reviewing the whole PR? I won't be able to fully review this as I am going to be out the next two days plus the weekend but I will try to make it a priority to take a closer look on Monday so we can talk through any major issues or concerns on Tuesday. That diagram is massively helpful for following what is going on here, thanks for making it! |
Sure, I could do this. But have you seen the new file tree in the GitHub review tool? If you click "files changed" then look on the left you should be able to expand a file tree. Doing this, you can hide the hundreds of .yml changes and get right to what matters. If you knew about this already and still want me to undo all but one of the .yml changes let me know 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some superficial comments below. I like the overall structure of this more than the first iteration, I am going to think overnight and tomorrow morning about various edge cases but I think it is potentially worth shipping this and see what explodes :) great work so far!
check_cache.py
Outdated
"linux_sha": linux_sha, | ||
"clang_version": clang_version, | ||
"build_status": "presuite", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple canary for patches would be getting the hash of the series
file if there is one present, as that will have to be updated any time that we change patch files. I wouldn't be as concerned with updating the patch files themselves, that is usually for metadata updates, which shouldn't affect the final result.
generate_workflow.py
Outdated
"run": "python check_cache.py -w '${{github.workflow}}' " | ||
"-g ${{secrets.REPO_SCOPED_PAT}} " | ||
"-r ${{env.GIT_REF}} " | ||
"-o ${{env.GIT_REPO}}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding spaces between the ${{
/}}
and the name of the variable to match the style of the rest of the workflow.
The linter gods have been appeased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is in a good enough spot to merge now and follow up with issues and individual changes.
My plan is to squash and merge, change one job's cron string manually, then push to main
to start a cycle with this added into try and catch any major issues before a bunch of other jobs kick off.
If there are any issues, we can either forward fix or revert, depending on the nature of the issue.
This PR aims to implement a caching system to reduce the total amount of Tuxsuite build minutes used. A previous historical audit of our workflows showed that
17%
of builds were duplicated and may have been cacheable.Initially, it seemed a dispatcher architecture like the one KernelCI uses was the right idea. However, there are some problems with that approach:
Choosing a dispatch frequency that properly suited all jobs in the matrix is not feasible. Say we choose "Once a Day", some workflows aren't worth running that frequently (even if some piece of their state has changed -- toolchain or kernel sha). We want more control in terms of frequency per job.
There may be a bandwidth issue with both GitHub and Tuxsuite. If the dispatcher has a lot of cache misses then it will spin up loads of jobs all at once because, again, we're missing control over job frequency. A large amount of jobs being queued may cause more jobs to sporadically error-out on Tuxsuite's end and may even cause delays on GitHub's end as only so many workflows can run at once.
It's a bit more difficult to propagate job statuses back to the dispatcher as we have to wait until Tuxsuite does its thing. So the dispatcher has to periodically check back in hopes it gets a status update. However, with a self-contained workflow it is ran linearly and job status is easier to manage in the cache.
To remedy all of these problems, a different approach is required. Looking back at #522 the main problem was the shear number of branches that we had to spin up to use as our caching system. However, with the introduction of GitHub Repository Variables we have a sleek key:value store that has shared storage across workflows. There is now no need to create a bunch of branches or a separate repo.
This PR is inspired by #522 but with an interface to use these Repository Variables and some other opinionated changes.
For starters, I implemented this understanding what Nathan was doing in #522 but without following each implementation detail. Due to this, there are many differences and this PR deserves its own round of code review.
A current CI workflow run diagram may look like this:

A cache-implemented CI workflow run diagram will look something like this:

Notice how future jobs neither pass or fail but rather are stopped upon cache hit + validation. This makes it clear to see when caching stepped in
There is now one lightweight step before Tuxsuite does its thing. We spin up a container and run
check_cache.py
. This script's whole purpose in life is to determine whether or not a Tuxsuite job should run. Here's the criteria:(lingo: cache hit means same linux sha and same compiler version in cache compared to current)
If we cache hit we check the status against some allowed statuses like
pass
orfail
. These statuses can only be set by a completed Tuxsuite job and won't be set by random Tuxsuite timeouts or failures. For those cases, a statuspresuite
is assigned. If our cache hit is not explicitlypass
orfail
we continue on to run the whole Tuxsuite plan. This is to prevent us from caching random one-off failures.If we cache hit and it is a "cache-able" status like
pass
orfail
then we do not run the Tuxsuite plan and instead return an exit code such that GitHub knows what color to set the matrix. To be clear, if a build has a cache hit and the cached status waspass
then the Workflow will not run the Tuxsuite plan but it will still return a green square for our matrix. Likewise for red andfail
. This means that even with caching our matrix will still be colored correctly.If we cache miss, we update our cache to most recent versions and sha's and assign a
presuite
status. The Tuxsuite plan continues as normal. After the Tuxsuite plan finishes, some tooling incheck_logs.py
will kick-in and analyze thestatus.json
to properly update our cached status for this particular workflow. This will ensure that future runs of this workflow have accurate statuses to validate their cache off of.Now, if there is not yet a cache entry for a workflow run (it has yet to be run under this new system) then a cache entry will be created for it and it will continue as normal (described above).
check_cache.py
can fail and will fail (sometimes due to GitHub rate-limiting or cosmic bit flips). However, all the workflow logic is designed to just ignore its failures and continue on as if we had a cache miss. This entirely eliminates false-positives that we may have seen. Now, we are just lowering our cache efficiency -- this is a far better outcome.Questions you may have:
REPO_SCOPED_PAT
?