fix: Wait for context cancel in k8s pod watcher #6643

halvards · 2021-09-28T07:08:19Z

Description
Make the Kubernetes pod watcher context aware, and select on context.Done(). This means we can stop waiting, and acting on, pod events when the context has been cancelled.

Remove waiting on context.Done() in the Kubernetes log aggregator, container manager, and pod port forwarder. This is to eliminate the chance that the pod watcher sends a PodEvent on a channel without a waiting receiver.

This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher.

We still close the channels on the receiver side, which could increase the chances of regression and re-occurrence of this issue.

Also uses a RWMutex in the pod watcher, though we could move this change to a separate PR.

Fixes: #6424

halvards · 2021-09-28T07:08:50Z

Marking as Draft so I can see the behavior on kokoro.

codecov · 2021-09-28T07:31:22Z

Codecov Report

Merging #6643 (f27a579) into main (290280e) will decrease coverage by 0.49%.
The diff coverage is 52.94%.

@@            Coverage Diff             @@
##             main    #6643      +/-   ##
==========================================
- Coverage   70.48%   69.98%   -0.50%     
==========================================
  Files         515      522       +7     
  Lines       23150    23732     +582     
==========================================
+ Hits        16317    16609     +292     
- Misses       5776     6038     +262     
- Partials     1057     1085      +28

Impacted Files	Coverage Δ
cmd/skaffold/app/cmd/flags.go	`89.00% <0.00%> (-1.82%)`	⬇️
cmd/skaffold/skaffold.go	`0.00% <ø> (ø)`
pkg/diag/recommender/container_errors.go	`0.00% <0.00%> (ø)`
pkg/diag/validator/pod.go	`1.32% <0.00%> (ø)`
pkg/skaffold/build/buildpacks/logger.go	`0.00% <ø> (ø)`
pkg/skaffold/build/cluster/logs.go	`0.00% <ø> (-16.67%)`	⬇️
pkg/skaffold/build/jib/errors.go	`48.93% <50.00%> (ø)`
cmd/skaffold/app/cmd/cmd.go	`70.32% <66.66%> (-0.73%)`	⬇️
pkg/diag/validator/resource.go	`47.05% <66.66%> (ø)`
pkg/skaffold/build/docker/docker.go	`86.44% <66.66%> (-2.85%)`	⬇️
... and 78 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 151b8d3...f27a579. Read the comment docs.

halvards · 2021-09-28T10:45:09Z

Integration test fails on kokoro at a different point:

skaffold/integration/dev_test.go

Line 123 in 151b8d3

state, _ := p.Wait()

Full logs: https://source.cloud.google.com/results/invocations/22b8290e-9bab-4aaf-8f68-533a32ed0ca9/targets/skaffold%2Fpresubmit/log

There was a similar failure in the integration tests running on GitHub Actions for a previous commit: https://github.com/GoogleContainerTools/skaffold/pull/6643/checks?check_run_id=3729355176
That commit had additional logging enabled for TestDevGracefulCancel(), which may have slowed it down sufficiently to expose the issue.

Is there more than one issue?

Make the Kubernetes pod watcher context aware, and select on `context.Done()`. This means we can stop waiting, and acting on, pod events when the context has been cancelled. Remove waiting on `context.Done()` in the Kubernetes log aggregator, container manager, and pod port forwarder. This is to eliminate the chance that the pod watcher sends a `PodEvent` on a channel without a waiting receiver. This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher. We still close the channels on the receiver side, which could increase the chances of regression and re-occurrence of this issue. Also uses a RWMutex in the pod watcher, though we could move this change to a separate commit. Fixes: GoogleContainerTools#6424

MarlonGamez

Looks good to me 👍🏼 Thanks @halvards !

ahmetb · 2021-09-29T23:01:17Z

mad props! thanks for getting to the bottom of this.

Adding log statements to better understand what's happening in kokoro. Unfortunately I haven't been able to reproduce the problem locally (on Linux) so far. Related: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662

Debugging the flaky `TestDevGracefulCancel/multi-config-microservices` integration test showed that the kubectl port forwarder was stuck with goroutines waiting on a channels(one per resource). Search for `goroutine 235` and `goroutine 234` in this Kokoro log: https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log This change means that the forwarder also listens for context canceled. **Related**: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662, GoogleContainerTools#6685

Debugging the flaky `TestDevGracefulCancel/multi-config-microservices` integration test showed that the kubectl port forwarder was stuck with goroutines waiting on a channels(one per resource). Search for `goroutine 235` and `goroutine 234` in this Kokoro log: https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log This change means that the forwarder also listens for context canceled. **Related**: #6424, #6643, #6662, #6685

pull-request-size bot added the size/L label Sep 28, 2021

google-cla bot added the cla: yes label Sep 28, 2021

halvards added the pr/blocked label Sep 28, 2021

halvards force-pushed the pod-watcher-deadlock branch from 182e2a3 to 01212be Compare September 28, 2021 08:34

halvards added kokoro:run runs the kokoro jobs on a PR and removed pr/blocked labels Sep 28, 2021

kokoro-team removed the kokoro:run runs the kokoro jobs on a PR label Sep 28, 2021

halvards added the !! do-not-merge !! label Sep 28, 2021

halvards force-pushed the pod-watcher-deadlock branch from 01212be to a89db39 Compare September 29, 2021 01:30

halvards added the kokoro:force-run forces a kokoro re-run on a PR label Sep 29, 2021

kokoro-team removed the kokoro:force-run forces a kokoro re-run on a PR label Sep 29, 2021

halvards removed the !! do-not-merge !! label Sep 29, 2021

halvards marked this pull request as ready for review September 29, 2021 04:50

halvards requested a review from a team as a code owner September 29, 2021 04:50

halvards requested a review from tejal29 September 29, 2021 04:50

halvards force-pushed the pod-watcher-deadlock branch from a89db39 to f27a579 Compare September 29, 2021 04:58

halvards added the kokoro:run runs the kokoro jobs on a PR label Sep 29, 2021

kokoro-team removed the kokoro:run runs the kokoro jobs on a PR label Sep 29, 2021

MarlonGamez approved these changes Sep 29, 2021

View reviewed changes

MarlonGamez merged commit 3df44e3 into GoogleContainerTools:main Sep 29, 2021

halvards mentioned this pull request Oct 6, 2021

build: 30 min timeout for integration tests #6684

Merged

halvards mentioned this pull request Oct 6, 2021

build: debug flaky TestDevGracefulCancel/multi-config-microservices integration test #6685

Closed

halvards mentioned this pull request Oct 7, 2021

fix: Kubectl port fwd returns on context canceled #6700

Merged

MarlonGamez mentioned this pull request Oct 11, 2021

fix: panic caused by multiple channel closes #6714

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Wait for context cancel in k8s pod watcher #6643

fix: Wait for context cancel in k8s pod watcher #6643

halvards commented Sep 28, 2021 •

edited

Loading

halvards commented Sep 28, 2021

codecov bot commented Sep 28, 2021 •

edited

Loading

halvards commented Sep 28, 2021

MarlonGamez left a comment

ahmetb commented Sep 29, 2021

fix: Wait for context cancel in k8s pod watcher #6643

fix: Wait for context cancel in k8s pod watcher #6643

Conversation

halvards commented Sep 28, 2021 • edited Loading

halvards commented Sep 28, 2021

codecov bot commented Sep 28, 2021 • edited Loading

Codecov Report

halvards commented Sep 28, 2021

MarlonGamez left a comment

Choose a reason for hiding this comment

ahmetb commented Sep 29, 2021

halvards commented Sep 28, 2021 •

edited

Loading

codecov bot commented Sep 28, 2021 •

edited

Loading