CI Caching #664

JustinStitt · 2023-12-02T01:11:30Z

This PR aims to implement a caching system to reduce the total amount of Tuxsuite build minutes used. A previous historical audit of our workflows showed that 17% of builds were duplicated and may have been cacheable.

Initially, it seemed a dispatcher architecture like the one KernelCI uses was the right idea. However, there are some problems with that approach:

Choosing a dispatch frequency that properly suited all jobs in the matrix is not feasible. Say we choose "Once a Day", some workflows aren't worth running that frequently (even if some piece of their state has changed -- toolchain or kernel sha). We want more control in terms of frequency per job.
There may be a bandwidth issue with both GitHub and Tuxsuite. If the dispatcher has a lot of cache misses then it will spin up loads of jobs all at once because, again, we're missing control over job frequency. A large amount of jobs being queued may cause more jobs to sporadically error-out on Tuxsuite's end and may even cause delays on GitHub's end as only so many workflows can run at once.
It's a bit more difficult to propagate job statuses back to the dispatcher as we have to wait until Tuxsuite does its thing. So the dispatcher has to periodically check back in hopes it gets a status update. However, with a self-contained workflow it is ran linearly and job status is easier to manage in the cache.

To remedy all of these problems, a different approach is required. Looking back at #522 the main problem was the shear number of branches that we had to spin up to use as our caching system. However, with the introduction of GitHub Repository Variables we have a sleek key:value store that has shared storage across workflows. There is now no need to create a bunch of branches or a separate repo.

This PR is inspired by #522 but with an interface to use these Repository Variables and some other opinionated changes.

For starters, I implemented this understanding what Nathan was doing in #522 but without following each implementation detail. Due to this, there are many differences and this PR deserves its own round of code review.

A current CI workflow run diagram may look like this:

A cache-implemented CI workflow run diagram will look something like this:

Notice how future jobs neither pass or fail but rather are stopped upon cache hit + validation. This makes it clear to see when caching stepped in

There is now one lightweight step before Tuxsuite does its thing. We spin up a container and run check_cache.py. This script's whole purpose in life is to determine whether or not a Tuxsuite job should run. Here's the criteria:

(lingo: cache hit means same linux sha and same compiler version in cache compared to current)

If we cache hit we check the status against some allowed statuses like pass or fail. These statuses can only be set by a completed Tuxsuite job and won't be set by random Tuxsuite timeouts or failures. For those cases, a status presuite is assigned. If our cache hit is not explicitly pass or fail we continue on to run the whole Tuxsuite plan. This is to prevent us from caching random one-off failures.
If we cache hit and it is a "cache-able" status like pass or fail then we do not run the Tuxsuite plan and instead return an exit code such that GitHub knows what color to set the matrix. To be clear, if a build has a cache hit and the cached status was pass then the Workflow will not run the Tuxsuite plan but it will still return a green square for our matrix. Likewise for red and fail. This means that even with caching our matrix will still be colored correctly.
If we cache miss, we update our cache to most recent versions and sha's and assign a presuite status. The Tuxsuite plan continues as normal. After the Tuxsuite plan finishes, some tooling in check_logs.py will kick-in and analyze the status.json to properly update our cached status for this particular workflow. This will ensure that future runs of this workflow have accurate statuses to validate their cache off of.
Now, if there is not yet a cache entry for a workflow run (it has yet to be run under this new system) then a cache entry will be created for it and it will continue as normal (described above).

check_cache.py can fail and will fail (sometimes due to GitHub rate-limiting or cosmic bit flips). However, all the workflow logic is designed to just ignore its failures and continue on as if we had a cache miss. This entirely eliminates false-positives that we may have seen. Now, we are just lowering our cache efficiency -- this is a far better outcome.

Questions you may have:

What is REPO_SCOPED_PAT?

It's a github token with the minimal permissions scope to handle Repository Variables

What about BOOTing? Can the CI cache BOOT failures and successes.

I think so, I implemented it but we need a full end-to-end test on main

check_cache.py

nickdesaulniers · 2023-12-04T17:00:48Z

check_cache.py

+            "linux_sha": linux_sha,
+            "clang_version": clang_version,
+            "build_status": "presuite",
+        }


do we need to store any info about the patches, so that we can detect when the patches to be applied change?

Yes totally. I considered this but didn't want to add more layers before testing the core functionality. At the very least, there is a workflow_dispatch backdoor that means when a patchset is changed we can manually dispatch a new run and caching will step out of the way.

tl;dr this feature is wip

A simple canary for patches would be getting the hash of the series file if there is one present, as that will have to be updated any time that we change patch files. I wouldn't be as concerned with updating the patch files themselves, that is usually for metadata updates, which shouldn't affect the final result.

check_cache.py

check_logs.py

generate_workflow.py

Signed-off-by: Justin Stitt <[email protected]>

nathanchance · 2023-12-05T23:31:21Z

Justin and I chatted in person about this implementation today and I think it is looking solid so far. However, one major issue that we realized with the current implementation is calling the GitHub API in check_logs.py is potentially run into both

rate limit issues, although I think if we supply our own token rather than using GitHub's, which is what is being done in this PR through REPO_SCOPED_PAT, we have 5000 requests per hour, which should be fine.
synchronization issues, as one build may update the build_status variable from pass to fail then another might undo that.

To fix both of these issues, I think we should have a step between the tuxsuite_ jobs that opens all the builds.json generated by tuxsuite, iterates through all builds, and updates the status atomically for the whole workflow (if any build failed, it is a fail, otherwise it is a pass). It is probably worth populating the cache at this point with all three variables (linux revision, clang version, and build result), rather than doing in it in check_cache.py to be slightly more accurate. While removing the caching from check_logs.py means that we won't be able to cache on boot results, I think that is fine since it simplifies the implementation. In general, I re-run jobs that have spurious boot failures and they do not happen all too often anyways.

Signed-off-by: Justin Stitt <[email protected]>

JustinStitt · 2023-12-06T22:38:20Z

Added a new batch of changes to address some issues brought to my attention after meeting with @nathanchance. Now, each kick_tuxsuite_foo job will have an additional step where it will update the cache build status for the workflow. This is preferable to relying on each individual check_logs.py job to update the cache as we run into rate-limiting and concurrency issues. With the new approach there is absolutely no way in heck that all three kick_tuxsuite_foo jobs end at the same time and write garbage to the cache.

As always, if any part of the caching infra fails it should step out of the way and let the Tuxsuite plan and GitHub action continue as normal. This caching system is simply an addon and shouldn't drastically modify existing behavior that you would expect.

JustinStitt · 2023-12-06T22:39:17Z

Side note: I'll install ruff locally and address all these linting concerns right before we merge -- but while stuff is still in motion I'm not going to try and constantly appease the linter gods.

JustinStitt · 2023-12-06T22:52:21Z

And here's a diagram

not shown here is that workflow_dispatch operates as a backdoor straight to Tuxsuite Plan and that techincally a "Cache Miss" also encompasses a "Cache Hit" that did not have a cacheable build_status like pass or fail.

nathanchance · 2023-12-06T23:14:13Z

Yeah don't bother with the linting stuff until we settle on everything. As a matter of fact, could you remove all the workflow changes but just one so that we can still see what the workflow changes will look like but not get drowned by them when reviewing the whole PR?

I won't be able to fully review this as I am going to be out the next two days plus the weekend but I will try to make it a priority to take a closer look on Monday so we can talk through any major issues or concerns on Tuesday. That diagram is massively helpful for following what is going on here, thanks for making it!

JustinStitt · 2023-12-11T21:06:06Z

... As a matter of fact, could you remove all the workflow changes but just one so that we can still see what the workflow changes will look like but not get drowned by them when reviewing the whole PR?

Sure, I could do this.

But have you seen the new file tree in the GitHub review tool? If you click "files changed" then look on the left you should be able to expand a file tree. Doing this, you can hide the hundreds of .yml changes and get right to what matters.

If you knew about this already and still want me to undo all but one of the .yml changes let me know 👍

nathanchance

Some superficial comments below. I like the overall structure of this more than the first iteration, I am going to think overnight and tomorrow morning about various edge cases but I think it is potentially worth shipping this and see what explodes :) great work so far!

check_cache.py

nathanchance · 2023-12-11T23:54:58Z

check_cache.py

+            "linux_sha": linux_sha,
+            "clang_version": clang_version,
+            "build_status": "presuite",
+        }


A simple canary for patches would be getting the hash of the series file if there is one present, as that will have to be updated any time that we change patch files. I wouldn't be as concerned with updating the patch files themselves, that is usually for metadata updates, which shouldn't affect the final result.

check_logs.py

nathanchance · 2023-12-11T23:59:37Z

generate_workflow.py

+                    "run": "python check_cache.py -w '${{github.workflow}}' "
+                    "-g ${{secrets.REPO_SCOPED_PAT}} "
+                    "-r ${{env.GIT_REF}} "
+                    "-o ${{env.GIT_REPO}}",


Consider adding spaces between the ${{/}} and the name of the variable to match the style of the rest of the workflow.

update_cache.py

JustinStitt · 2023-12-12T23:07:16Z

The linter gods have been appeased.

nathanchance

I think this is in a good enough spot to merge now and follow up with issues and individual changes.

My plan is to squash and merge, change one job's cron string manually, then push to main to start a cycle with this added into try and catch any major issues before a bunch of other jobs kick off.

If there are any issues, we can either forward fix or revert, depending on the nature of the issue.

JustinStitt requested a review from kees December 2, 2023 01:11

JustinStitt requested review from nathanchance, nickdesaulniers and bwendling as code owners December 2, 2023 01:11

JustinStitt commented Dec 2, 2023

View reviewed changes

check_cache.py Show resolved Hide resolved

nickdesaulniers reviewed Dec 4, 2023

View reviewed changes

JustinStitt added 3 commits December 4, 2023 22:29

add caching infra

5ca1b30

run generate.py

246d1e4

update mock.builds.json

0cc2a7c

JustinStitt force-pushed the ci-caching branch from aa2984c to 0cc2a7c Compare December 4, 2023 22:29

add note

6f1a3be

Signed-off-by: Justin Stitt <[email protected]>

JustinStitt added 3 commits December 6, 2023 22:32

upd

35cb982

Signed-off-by: Justin Stitt <[email protected]>

add update_cache

780c130

Signed-off-by: Justin Stitt <[email protected]>

regenerate

03d93b4

nathanchance reviewed Dec 12, 2023

View reviewed changes

update_cache.py Show resolved Hide resolved

JustinStitt added 5 commits December 12, 2023 00:52

revert check_logs.py changes

d4aa4f2

[lint] fix various

9d20223

update requirements.txt

b09163e

[lint] fix pylint

4b0549c

[lint] ran yapf

fcd640f

nathanchance approved these changes Dec 19, 2023

View reviewed changes

nathanchance merged commit 277ae6d into main Dec 19, 2023

nathanchance deleted the ci-caching branch December 19, 2023 22:35

JustinStitt mentioned this pull request Dec 20, 2023

Initial stab at front end cache checks #522

Closed

nathanchance mentioned this pull request Dec 20, 2023

front end caching #308

Closed

7 tasks

CI Caching #664

CI Caching #664

Uh oh!

Conversation

JustinStitt commented Dec 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nickdesaulniers Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

JustinStitt Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

nathanchance Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nathanchance commented Dec 5, 2023

Uh oh!

JustinStitt commented Dec 6, 2023

Uh oh!

JustinStitt commented Dec 6, 2023

Uh oh!

JustinStitt commented Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nathanchance commented Dec 6, 2023

Uh oh!

JustinStitt commented Dec 11, 2023

Uh oh!

nathanchance left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nathanchance Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nathanchance Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JustinStitt commented Dec 12, 2023

Uh oh!

nathanchance left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JustinStitt commented Dec 2, 2023 •

edited

Loading

JustinStitt commented Dec 6, 2023 •

edited

Loading