Skip to content

fix: Wait for context cancel in k8s pod watcher #6643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

halvards
Copy link
Contributor

@halvards halvards commented Sep 28, 2021

Description
Make the Kubernetes pod watcher context aware, and select on context.Done(). This means we can stop waiting, and acting on, pod events when the context has been cancelled.

Remove waiting on context.Done() in the Kubernetes log aggregator, container manager, and pod port forwarder. This is to eliminate the chance that the pod watcher sends a PodEvent on a channel without a waiting receiver.

This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher.

We still close the channels on the receiver side, which could increase the chances of regression and re-occurrence of this issue.

Also uses a RWMutex in the pod watcher, though we could move this change to a separate PR.

Fixes: #6424

@halvards
Copy link
Contributor Author

Marking as Draft so I can see the behavior on kokoro.

@codecov
Copy link

codecov bot commented Sep 28, 2021

Codecov Report

Merging #6643 (f27a579) into main (290280e) will decrease coverage by 0.49%.
The diff coverage is 52.94%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6643      +/-   ##
==========================================
- Coverage   70.48%   69.98%   -0.50%     
==========================================
  Files         515      522       +7     
  Lines       23150    23732     +582     
==========================================
+ Hits        16317    16609     +292     
- Misses       5776     6038     +262     
- Partials     1057     1085      +28     
Impacted Files Coverage Δ
cmd/skaffold/app/cmd/flags.go 89.00% <0.00%> (-1.82%) ⬇️
cmd/skaffold/skaffold.go 0.00% <ø> (ø)
pkg/diag/recommender/container_errors.go 0.00% <0.00%> (ø)
pkg/diag/validator/pod.go 1.32% <0.00%> (ø)
pkg/skaffold/build/buildpacks/logger.go 0.00% <ø> (ø)
pkg/skaffold/build/cluster/logs.go 0.00% <ø> (-16.67%) ⬇️
pkg/skaffold/build/jib/errors.go 48.93% <50.00%> (ø)
cmd/skaffold/app/cmd/cmd.go 70.32% <66.66%> (-0.73%) ⬇️
pkg/diag/validator/resource.go 47.05% <66.66%> (ø)
pkg/skaffold/build/docker/docker.go 86.44% <66.66%> (-2.85%) ⬇️
... and 78 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 151b8d3...f27a579. Read the comment docs.

@halvards halvards force-pushed the pod-watcher-deadlock branch from 182e2a3 to 01212be Compare September 28, 2021 08:34
@halvards halvards added kokoro:run runs the kokoro jobs on a PR and removed pr/blocked labels Sep 28, 2021
@kokoro-team kokoro-team removed the kokoro:run runs the kokoro jobs on a PR label Sep 28, 2021
@halvards
Copy link
Contributor Author

Integration test fails on kokoro at a different point:

state, _ := p.Wait()

Full logs: https://source.cloud.google.com/results/invocations/22b8290e-9bab-4aaf-8f68-533a32ed0ca9/targets/skaffold%2Fpresubmit/log

There was a similar failure in the integration tests running on GitHub Actions for a previous commit: https://github.com/GoogleContainerTools/skaffold/pull/6643/checks?check_run_id=3729355176
That commit had additional logging enabled for TestDevGracefulCancel(), which may have slowed it down sufficiently to expose the issue.

Is there more than one issue?

@halvards halvards force-pushed the pod-watcher-deadlock branch from 01212be to a89db39 Compare September 29, 2021 01:30
@halvards halvards added the kokoro:force-run forces a kokoro re-run on a PR label Sep 29, 2021
@kokoro-team kokoro-team removed the kokoro:force-run forces a kokoro re-run on a PR label Sep 29, 2021
@halvards halvards marked this pull request as ready for review September 29, 2021 04:50
@halvards halvards requested a review from a team as a code owner September 29, 2021 04:50
@halvards halvards requested a review from tejal29 September 29, 2021 04:50
Make the Kubernetes pod watcher context aware, and select on
`context.Done()`. This means we can stop waiting, and acting on, pod
events when the context has been cancelled.

Remove waiting on `context.Done()` in the Kubernetes log aggregator,
container manager, and pod port forwarder. This is to eliminate the
chance that the pod watcher sends a `PodEvent` on a channel without a
waiting receiver.

This should help avoid deadlocks that can occur when pod watcher event
receivers stop reading from the channel that they've registered with the
pod watcher.

We still close the channels on the receiver side, which could increase
the chances of regression and re-occurrence of this issue.

Also uses a RWMutex in the pod watcher, though we could move this change
to a separate commit.

Fixes: GoogleContainerTools#6424
@halvards halvards force-pushed the pod-watcher-deadlock branch from a89db39 to f27a579 Compare September 29, 2021 04:58
@halvards halvards added the kokoro:run runs the kokoro jobs on a PR label Sep 29, 2021
@kokoro-team kokoro-team removed the kokoro:run runs the kokoro jobs on a PR label Sep 29, 2021
Copy link
Contributor

@MarlonGamez MarlonGamez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍🏼 Thanks @halvards !

@MarlonGamez MarlonGamez merged commit 3df44e3 into GoogleContainerTools:main Sep 29, 2021
@ahmetb
Copy link
Contributor

ahmetb commented Sep 29, 2021

mad props! thanks for getting to the bottom of this.

halvards added a commit to halvards/skaffold that referenced this pull request Oct 6, 2021
Adding log statements to better understand what's happening in kokoro.
Unfortunately I haven't been able to reproduce the problem locally (on
Linux) so far.

Related: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662
halvards added a commit to halvards/skaffold that referenced this pull request Oct 7, 2021
Debugging the flaky `TestDevGracefulCancel/multi-config-microservices`
integration test showed that the kubectl port forwarder was stuck with
goroutines waiting on a channels(one per resource).

Search for `goroutine 235` and `goroutine 234` in this Kokoro log:
https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log

This change means that the forwarder also listens for context canceled.

**Related**: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662, GoogleContainerTools#6685
briandealwis pushed a commit that referenced this pull request Oct 7, 2021
Debugging the flaky `TestDevGracefulCancel/multi-config-microservices`
integration test showed that the kubectl port forwarder was stuck with
goroutines waiting on a channels(one per resource).

Search for `goroutine 235` and `goroutine 234` in this Kokoro log:
https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log

This change means that the forwarder also listens for context canceled.

**Related**: #6424, #6643, #6662, #6685
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flake] TestDevGracefulCancel no output for 10 mins
4 participants