-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: Wait for context cancel in k8s pod watcher #6643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Wait for context cancel in k8s pod watcher #6643
Conversation
Marking as Draft so I can see the behavior on kokoro. |
Codecov Report
@@ Coverage Diff @@
## main #6643 +/- ##
==========================================
- Coverage 70.48% 69.98% -0.50%
==========================================
Files 515 522 +7
Lines 23150 23732 +582
==========================================
+ Hits 16317 16609 +292
- Misses 5776 6038 +262
- Partials 1057 1085 +28
Continue to review full report at Codecov.
|
182e2a3
to
01212be
Compare
Integration test fails on kokoro at a different point: skaffold/integration/dev_test.go Line 123 in 151b8d3
There was a similar failure in the integration tests running on GitHub Actions for a previous commit: https://github.com/GoogleContainerTools/skaffold/pull/6643/checks?check_run_id=3729355176 Is there more than one issue? |
01212be
to
a89db39
Compare
Make the Kubernetes pod watcher context aware, and select on `context.Done()`. This means we can stop waiting, and acting on, pod events when the context has been cancelled. Remove waiting on `context.Done()` in the Kubernetes log aggregator, container manager, and pod port forwarder. This is to eliminate the chance that the pod watcher sends a `PodEvent` on a channel without a waiting receiver. This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher. We still close the channels on the receiver side, which could increase the chances of regression and re-occurrence of this issue. Also uses a RWMutex in the pod watcher, though we could move this change to a separate commit. Fixes: GoogleContainerTools#6424
a89db39
to
f27a579
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍🏼 Thanks @halvards !
mad props! thanks for getting to the bottom of this. |
Adding log statements to better understand what's happening in kokoro. Unfortunately I haven't been able to reproduce the problem locally (on Linux) so far. Related: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662
Debugging the flaky `TestDevGracefulCancel/multi-config-microservices` integration test showed that the kubectl port forwarder was stuck with goroutines waiting on a channels(one per resource). Search for `goroutine 235` and `goroutine 234` in this Kokoro log: https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log This change means that the forwarder also listens for context canceled. **Related**: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662, GoogleContainerTools#6685
Debugging the flaky `TestDevGracefulCancel/multi-config-microservices` integration test showed that the kubectl port forwarder was stuck with goroutines waiting on a channels(one per resource). Search for `goroutine 235` and `goroutine 234` in this Kokoro log: https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log This change means that the forwarder also listens for context canceled. **Related**: #6424, #6643, #6662, #6685
Description
Make the Kubernetes pod watcher context aware, and select on
context.Done()
. This means we can stop waiting, and acting on, pod events when the context has been cancelled.Remove waiting on
context.Done()
in the Kubernetes log aggregator, container manager, and pod port forwarder. This is to eliminate the chance that the pod watcher sends aPodEvent
on a channel without a waiting receiver.This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher.
We still close the channels on the receiver side, which could increase the chances of regression and re-occurrence of this issue.
Also uses a RWMutex in the pod watcher, though we could move this change to a separate PR.
Fixes: #6424