Add artifact cache to track artifacts for faster restart #1632

priyawadhwa · 2019-02-08T23:19:52Z

Old PR for this: #1532

I added an artifact cache, which stores artifacts at ~/.skaffold/cache in the format [workspace hash : image digest]

Before building an artifact, skaffold checks if the associated workspace hash is in the cache. If it isn't, the artifact will be rebuilt. If it is, then skaffold will check if the image imageName:workspaceHash (with the same digest in the cache) exists. If it does, skaffold will push if it doesn't exist remotely.

If imageName:workspaceHash doesn't exist locally, skaffold will see if another image with the same digest exists locally. If it finds one, it will retag that image as imageName:workspaceHash and push it.

TODO (I'll probably implement this in another PR):

Add "skaffold cache" command, where a user can specify an image in the imageName:workspaceHash format, which will add that image to the local cache file. This way, if one dev builds the image another dev can easily add it to their cache.

I added an artifact cache (stored in ~/.skaffold/cache) which maps a hash of the artifact's dependencies : image name and image tag. I used kaniko's snapshotting library, which does exactly this for kaniko caching. Before building an artifact, we check the cache to see if an image with the same hash and image name has already been built. If it has, we skip rebuilding this image. If either the hash or the image name is different, skaffold will rebuild the image. The cache can be disabled with the --cache-artifacts=false flag. Future work could include optimization around retagging cached images (if we find the hash for an image, but the image name is different, perhaps we can try to retag the image instead of rebuilding).

Fixed linting errors. Also explicitly hid --azure-container-registry-config flag which is coming in through a dependency which uses pflag.String directly. See GoogleContainerTools/kaniko#445 for more details.

Slightly modified the hasher from kaniko to use with skaffold. Couldn't use kaniko's hasher because kaniko doesn't support building windows conatiners, and so the windows skaffold build was failing since syscall.Stat_T is used.

1. Cache is now a map of workspace-hash:image-ID When there is a cache hit, skaffold will check if an image already exists locally with the same ID. If it exists locally and not remotely, skaffold will retag and push the image, and then continue with the deploy. Also added a unit test to check this logic.

codecov-io · 2019-02-08T23:28:25Z

Codecov Report

Merging #1632 into master will increase coverage by 0.22%.
The diff coverage is 47.76%.

@@            Coverage Diff             @@
##           master    #1632      +/-   ##
==========================================
+ Coverage   46.54%   46.76%   +0.22%     
==========================================
  Files         125      126       +1     
  Lines        5657     5919     +262     
==========================================
+ Hits         2633     2768     +135     
- Misses       2751     2860     +109     
- Partials      273      291      +18

Impacted Files	Coverage Δ
pkg/skaffold/config/options.go	`88.23% <ø> (ø)`	⬆️
pkg/skaffold/util/util.go	`44.13% <0%> (-1.91%)`	⬇️
cmd/skaffold/app/cmd/cmd.go	`0% <0%> (ø)`	⬆️
pkg/skaffold/build/cache.go	`43.39% <43.39%> (ø)`
pkg/skaffold/runner/timings.go	`16.66% <71.42%> (+1.28%)`	⬆️
pkg/skaffold/runner/runner.go	`60.4% <75%> (+0.19%)`	⬆️
pkg/skaffold/docker/image.go	`58.38% <75.75%> (+6.95%)`	⬆️
pkg/skaffold/docker/client.go	`71.56% <0%> (+8.82%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3039001...5ad0980. Read the comment docs.

dgageot

The code looks good to me. It doesn't seem to work for local builds. I get this error:

DEBU[0000] error getting id for gcr.io/k8s-skaffold/skaffold-example:d8651272889c8ba6206df06411238ec9bb12ed6962e7975dade1c6b5e4a1d633: a digest must contain exactly one '@' separator (e.g. registry/repository@digest) saw: , skipping caching

dgageot

Also, it always pushed images that are already pushed

priyawadhwa · 2019-02-11T18:21:38Z

@dgageot could you provide the skaffold.yaml you're using? I haven't seen either of these issues yet, but I've only tested the getting-started and microservices examples

balopat · 2019-02-11T19:35:57Z

As discussed, I had the same issue using minikube - we have a repro.

r2d4 · 2019-02-12T00:17:39Z

Do we need to do another hash of the workspace here? The workspace hash is already calculated in the (git) tag

and digest When using the cache with minikube, there were no image digests available since the images had not been pushed to and pulled from a registry. To get around this, I store image ID as well and search for a local match to both the digest and the ID. I also check if running against a local cluster, and only push cached images if not.

priyawadhwa · 2019-02-12T17:37:50Z

@r2d4 it's actually a hash of artifact dependencies only (updated the comments in the code to say that)!

cc @balopat @dgageot I think this should work w/ minikube now

pkg/skaffold/build/cache.go

balopat

It's a bit confusing to see this in the logs on the default loglevel contradicting messages:

...
Checking cache...
Found gcr.io/k8s-skaffold/leeroy-web locally...
gcr.io/k8s-skaffold/leeroy-web:b24f9a316d1640623810de5b822a49bdd99386897459bb9e3e967b816a3f5703 ready, skipping rebuild
Found gcr.io/k8s-skaffold/leeroy-app locally...
gcr.io/k8s-skaffold/leeroy-app:34ac91a8021972e0b025fe0c641ac21f0f2bbc2e3852be86b20626abdef0d758 ready, skipping rebuild
Starting build...
Found [minikube] context, using local docker daemon.
Build complete in 107.123µs
Retagging cached images...
...

We should make those logs conditional on whether there is any build happening or all artifacts are cached.
plus I added some nits.
I'm still testing on windows now...but the minikube issue did go away - it is really fast, I like it a lot!

balopat · 2019-02-13T22:55:47Z

Works on windows + GKE

priyawadhwa · 2019-02-13T23:04:45Z

Build logs should only appear if there is anything to build now, and similarly cache retag logs should only appear if there is anything to retag!

pkg/skaffold/docker/image.go

pkg/skaffold/build/cache.go

…o remember-dev

nkubala · 2019-02-14T19:30:53Z

pkg/skaffold/build/cache.go

+	}
+	hashTag := fmt.Sprintf("%s:%s", a.ImageName, hash)
+	// Check if tagged image exists remotely with the same digest
+	existsRemotely := imageExistsRemotely(hashTag, imageDetails.Digest)


it looks to me like this is being called for each artifact. with n artifacts defined in the skaffold.yaml, this results in n calls to remote.Image(), and possibly an image push if we're on a remote cluster, which can take quite a while (especially with a slow network connection).

I think we should at the very least retrieve each artifact in parallel so these calls aren't blocking each other, but I also think that we should rethink this logic. if users are building their images locally we probably shouldn't be checking a remote registry at all.

So I've changed this logic to only check for a remote image if using a remote cluster.

Parallelizing this could definitely make it faster, but I do think we need to do it because it can save a retag/push later on if the image already exists.

Also, only check if image exists remotely when using a remote cluster.

Get details about an artifact first (if it needs to be rebuilt, retagged or pushed), and then execute. Also added a unit test for this.

dgageot

@priyawadhwa Here's some feedback:

I find the logs in case of a cache hit too verbose. We say "retagging and pushing if necessary". Could we print that only if it's necessary?
Also, I think a cache hit should be in blue.
Checking cache sometimes takes a few seconds. Do we list images? Could we not list all images and lookup the image by digest or imageID or tag?
When there's a cache hit, we miss the build timings. It would be nice to see that it's actually faster than without the cache.
It seems to take some time after the Resolved ..., skipping rebuild message. What is it doing?
Reading the logs, it seems Skaffold computes the list of dependencies twice on a skaffold build. Is that expected?
When changing the name of the artifact's image, Skaffold thinks the image is already pushed. But it's not.

priyawadhwa · 2019-02-20T20:22:05Z

@dgageot -- updated the logging, colors, and added timing to the cache.

Checking cache sometimes takes a few seconds. Do we list images? Could we not list all images and lookup the image by digest or imageID or tag?

We do have to list all images, AFAICT there isn't a filter for image ID and the digest filter isn't reliable 👎 I'll keep looking for a faster way, but right now I think this is the best option.

It seems to take some time after the Resolved ..., skipping rebuild message. What is it doing?

If it's taking some time, it's probably moved on to checking if the next artifact is the cache.

Reading the logs, it seems Skaffold computes the list of dependencies twice on a skaffold build. Is that expected?

Yes, since we need to get the dependencies for an artifact to determine its associated hash.

When changing the name of the artifact's image, Skaffold thinks the image is already pushed. But it's not.

I haven't been able to repro this, could you provide more details about your build?

dgageot · 2019-02-25T12:23:03Z

@priyawadhwa It seems to work better now. Maybe I did something wrong or maybe it's because I have fewer images. Could you rebase so that I can run another round of tests? Thanks!

priyawadhwa · 2019-02-25T20:03:25Z

@dgageot done!

We agreed to merge the PR

Priya Wadhwa added 9 commits January 24, 2019 10:23

remember things that have been built

4ad14aa

Ran dep ensure to vendor in kaniko

9a5f389

Fix linting errors and hide azure flag

0fc6d98

Fixed linting errors. Also explicitly hid --azure-container-registry-config flag which is coming in through a dependency which uses pflag.String directly. See GoogleContainerTools/kaniko#445 for more details.

Copy hasher code from kaniko

ab1a6cf

Slightly modified the hasher from kaniko to use with skaffold. Couldn't use kaniko's hasher because kaniko doesn't support building windows conatiners, and so the windows skaffold build was failing since syscall.Stat_T is used.

remove hidden flag now that we don't vendor in kaniko

6316a41

Merge master

88fb430

Added unit tests for caching

d08a654

priyawadhwa requested review from balopat, dgageot, nkubala and r2d4 as code owners February 8, 2019 23:19

googlebot added the cla: yes label Feb 8, 2019

dgageot suggested changes Feb 9, 2019

View reviewed changes

Priya Wadhwa added 3 commits February 11, 2019 17:14

Rebased on master'

9c5d4b1

fix unit tests

548af93