-
Notifications
You must be signed in to change notification settings - Fork 18.8k
fix: daemon: state of stopped container visible to other queries when container is stopped #50136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: daemon: state of stopped container visible to other queries when container is stopped #50136
Conversation
ead31cc
to
bdc5cac
Compare
daemon/stop.go
Outdated
if err := ctr.CommitInMemory(daemon.containersReplica); err != nil { | ||
log.G(ctx).WithError(err).Warn("could not make state of container visible ot other queries") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CommitToMemory
function was added specifically for health-checks, as the healthcheck status is ephemeral, and doesn't have to be persisted to disk;
The other function to persist container state, both to disk and in memory is container.CheckpointTo
Lines 223 to 236 in ee69d8a
// CheckpointTo makes the Container's current state visible to queries, and persists state. | |
// Callers must hold a Container lock. | |
func (container *Container) CheckpointTo(ctx context.Context, store *ViewDB) error { | |
ctx, span := otel.Tracer("").Start(ctx, "container.CheckpointTo", trace.WithAttributes( | |
attribute.String("container.ID", container.ID), | |
attribute.String("container.Name", container.Name))) | |
defer span.End() | |
deepCopy, err := container.toDisk() | |
if err != nil { | |
return err | |
} | |
return store.Save(deepCopy) | |
} |
I see that's called in various places, such as on container start, pause/unpause, update, and some operations such as connecting and disconnecting from networks (for pause and unpause, I'm actually wondering if that's state to be persisted on disk, but that's a separate discussion).
For container stop it looks like handling ia a bit more involved; container.CheckpointTo
is called in daemon.handleContainerExit
(which is called in daemon.killWithSignal
and daemon.ProcessEvent
), and called in daemon.ProcessEvent
itself;
Line 32 in ee69d8a
func (daemon *Daemon) handleContainerExit(c *container.Container, e *libcontainerdtypes.EventInfo) error { |
Within daemon.ProcessEvent
, that code is handled conditionally, based on whether the event we received from containerd was for the container's main process (not for exec processes inside the container);
Lines 190 to 193 in ee69d8a
case libcontainerdtypes.EventExit: | |
if ei.ProcessID == ei.ContainerID { | |
return daemon.handleContainerExit(c, &ei) | |
} |
I wonder (but would have to look at traces, events or logs) if there's anything in the events processing that could explain why none of those paths were called by the moment the function returns, or if there's possibly some concurrency where multiple calls to container.CheckpointTo
are racing (possibly overwriting state with the wrong state?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, current state of this PR is quick & dirty.
I'm curious if there is a chance that daemon.Daemon.ContainerStop completes (and respective API endpoint replies. so that subsequent request is sent by the client) before call of (before return from) container.CheckpointTo
in daemon.Daemon.handleContainerExit - refer to
Line 120 in 01f442b
c.SetStopped(&ctrExitStatus) |
and
Line 128 in 01f442b
checkpointErr := c.CheckpointTo(context.TODO(), daemon.containersReplica) |
where the first call includes:
Line 322 in 01f442b
s.notifyAndClear(&s.stopWaiters) |
which should trigger:
Line 95 in 01f442b
if status := <-ctr.Wait(subCtx, container.WaitConditionNotRunning); status.Err() == nil { |
The lock acquired at daemon/stop.go:97 just helps to enforce ordering b/w daemon.Daemon.handleContainerExit (which holds lock of the same mutex from beginning of method) and daemon.Daemon.containerStop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I can remove daemon/stop.go:99-101 and check if it still works to test my hypothesis (note that fix for race condition issue can not be 100% proved that way 😄)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed CommitToMemory
from this PR because it seems to be not needed (refer to previous comments).
Here are results of test ("Reproduce" section from description of #50133) on master branch.
Debugger console output (details about breakpoints where these messages are logged into debugger console are provided at the end of this comment):
Before Container.SetStop in Daemon.handleContainerExit
After container stopped in Daemon.containerStop
Start of containerRouter.getContainersJSON
After Daemon.containersReplica in Daemon.Containers
After Container.CheckpointTo in Daemon.handleContainerExit
Start of containerRouter.getContainersJSON
After Daemon.containersReplica in Daemon.Containers
Test output (test failed):
* Trying /var/run/docker.sock:0...
* Connected to localhost (/var/run/docker.sock) port 80 (#0)
> POST /containers/3bd6f9fa938bbac06b546d6a7e8513736a0a1be66ef7a481d71d24f9a5f9075a/stop HTTP/1.1
> Host: localhost
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 204 No Content
< Api-Version: 1.50
< Docker-Experimental: false
< Ostype: linux
< Server: Docker/dev (linux)
< Date: Tue, 03 Jun 2025 16:58:27 GMT
<
* Connection #0 to host localhost left intact
* Found bundle for host: 0x5636e4b76440 [serially]
* Can not multiplex, even if we wanted to
* Re-using existing connection #0 with host localhost
> GET /containers/json?all=1&filters=%7B%22name%22%3A%7B%22%5Emy-nginx%24%22%3Atrue%7D%7D HTTP/1.1
> Host: localhost
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Api-Version: 1.50
< Content-Type: application/json
< Docker-Experimental: false
< Ostype: linux
< Server: Docker/dev (linux)
< Date: Tue, 03 Jun 2025 16:58:27 GMT
< Content-Length: 954
<
[{"Id":"3bd6f9fa938bbac06b546d6a7e8513736a0a1be66ef7a481d71d24f9a5f9075a","Names":["/my-nginx"],"Image":"nginx:alpine","ImageID":"sha256:6769dc3a703c719c1d2756bda113659be28ae16cf0da58dd5fd823d6b9a050ea","Command":"/docker-entrypoint.sh nginx -g 'daemon off;'","Created":1748969897,"Ports":[{"PrivatePort":80,"Type":"tcp"}],"Labels":{"maintainer":"NGINX Docker Maintainers <[email protected]>"},"State":"running","Status":"Up 10 seconds","HostConfig":{"NetworkMode":"bridge"},"NetworkSettings":{"Networks":{"bridge":{"IPAMConfig":null,"Links":null,"Aliases":null,"MacAddress":"72:d0:71:04:d1:e9","DriverOpts":null,"GwPriority":0,"NetworkID":"b563c5c9b89c462cf82a852a7075743a80e491b15d5c6fc270cd73cd9ef97245","EndpointID":"391be0e2c6ef0fdaaa60af41cfb13937f66c8700c7d10e19313162b1b9639864","Gateway":"172.18.0.1","IPAddress":"172.18.0.2","IPPrefixLen":16,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"DNSNames":null}}},"Mounts":[]}]
* Connection #0 to host localhost left intact
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3bd6f9fa938b nginx:alpine "/docker-entrypoint.…" 10 seconds ago Exited (0) Less than a second ago my-nginx
3bd6f9fa938bbac06b546d6a7e8513736a0a1be66ef7a481d71d24f9a5f9075a
Here are results of test ("Reproduce" section from description of #50133) on source branch of this pull request.
Debugger console output (details about breakpoints where these messages are logged into debugger console are provided at the end of this comment):
Before Container.SetStop in Daemon.handleContainerExit
After Container.CheckpointTo in Daemon.handleContainerExit
After container stopped in Daemon.containerStop
Start of containerRouter.getContainersJSON
After Daemon.containersReplica in Daemon.Containers
Start of containerRouter.getContainersJSON
After Daemon.containersReplica in Daemon.Containers
Test output (test succeeded):
* Trying /var/run/docker.sock:0...
* Connected to localhost (/var/run/docker.sock) port 80 (#0)
> POST /containers/d18b4b0aff9fe6b577a34050492782f41267ad1d36cf78e96ba7b197842d5968/stop HTTP/1.1
> Host: localhost
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 204 No Content
< Api-Version: 1.50
< Docker-Experimental: false
< Ostype: linux
< Server: Docker/dev (linux)
< Date: Tue, 03 Jun 2025 17:01:52 GMT
<
* Connection #0 to host localhost left intact
* Found bundle for host: 0x5621f5ac4440 [serially]
* Can not multiplex, even if we wanted to
* Re-using existing connection #0 with host localhost
> GET /containers/json?all=1&filters=%7B%22name%22%3A%7B%22%5Emy-nginx%24%22%3Atrue%7D%7D HTTP/1.1
> Host: localhost
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Api-Version: 1.50
< Content-Type: application/json
< Docker-Experimental: false
< Ostype: linux
< Server: Docker/dev (linux)
< Date: Tue, 03 Jun 2025 17:01:52 GMT
< Content-Length: 840
<
[{"Id":"d18b4b0aff9fe6b577a34050492782f41267ad1d36cf78e96ba7b197842d5968","Names":["/my-nginx"],"Image":"nginx:alpine","ImageID":"sha256:6769dc3a703c719c1d2756bda113659be28ae16cf0da58dd5fd823d6b9a050ea","Command":"/docker-entrypoint.sh nginx -g 'daemon off;'","Created":1748970102,"Ports":[],"Labels":{"maintainer":"NGINX Docker Maintainers <[email protected]>"},"State":"exited","Status":"Exited (0) Less than a second ago","HostConfig":{"NetworkMode":"bridge"},"NetworkSettings":{"Networks":{"bridge":{"IPAMConfig":null,"Links":null,"Aliases":null,"MacAddress":"","DriverOpts":null,"GwPriority":0,"NetworkID":"e6072ab17d3e68d3b47d9c6935a69b0133e47c9fb61f2864646c899de1ba29d8","EndpointID":"","Gateway":"","IPAddress":"","IPPrefixLen":0,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"DNSNames":null}}},"Mounts":[]}]
* Connection #0 to host localhost left intact
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d18b4b0aff9f nginx:alpine "/docker-entrypoint.…" 10 seconds ago Exited (0) Less than a second ago my-nginx
d18b4b0aff9fe6b577a34050492782f41267ad1d36cf78e96ba7b197842d5968
Breakpoints (without suspend, just output to debugger console):
Tested branch | Location of breakpoint | Message written to debugger console |
---|---|---|
master | daemon/monitor.go:120 | Before Container.SetStop in Daemon.handleContainerExit |
master | daemon/monitor.go:130 | After Container.CheckpointTo in Daemon.handleContainerExit |
master | daemon/stop.go:97 | After container stopped in Daemon.containerStop |
master | api/server/router/container/container_routes.go:77 | Start of containerRouter.getContainersJSON |
master | daemon/list.go:111 | After Daemon.containersReplica in Daemon.Containers |
50133-container_stop_state_sync | daemon/monitor.go:120 | Before Container.SetStop in Daemon.handleContainerExit |
50133-container_stop_state_sync | daemon/monitor.go:130 | After Container.CheckpointTo in Daemon.handleContainerExit |
50133-container_stop_state_sync | daemon/stop.go:100 | After container stopped in Daemon.containerStop |
50133-container_stop_state_sync | api/server/router/container/container_routes.go:77 | Start of containerRouter.getContainersJSON |
50133-container_stop_state_sync | daemon/list.go:111 | After Daemon.containersReplica in Daemon.Containers |
I understand that this kind of test is far from the best one, but it is the one I can do quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, feel free to use this PR for experimentation and to push commits for testing.
I know there's historically been quite some changes around these things; unfortunately some handling has become more complex than desirable (not surprising with 10+ Years of development); possibly some bits should be rewritten from scratch, but there's many corner cases that may have found their way into the existing code, so changing is risky.
Some of the refactoring resulted in the "in memory" store, which was added to prevent some gnarly situations with locking (which could cause, e.g., "docker ps" to get blocked by "docker info" calls), so we need to take care with such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(marked as unresolved to keep the discussion visible)
bdc5cac
to
b71bac4
Compare
7c6fcac
to
e7ba4cb
Compare
e7ba4cb
to
3d44846
Compare
Hi @thaJeztah, Could you please take a look at this PR? Feel free to check #50136 (comment) for details about the root cause of issue which this PR fixes (#50133). I'm not sure why some CI tests fail. If it's possible I appreciate if someone could trigger them one more time. On my local environment (I use development container) some unit and integration tests fail even for the master branch (and the results are different from time to time), so I can not rely on them for regression testing. Thank you. |
3d44846
to
ed75954
Compare
if status := <-ctr.Wait(subCtx, containertypes.WaitConditionNotRunning); status.Err() == nil { | ||
// Ensure container status changes are committed by handler of container exit before returning control to the caller | ||
ctr.Lock() | ||
defer ctr.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick blurb - I started to dig into this PR again to check code-paths, but didn't finish, so let me jot down some thoughts I had while looking;
I started to wonder if there's a bug in ctr.Wait
, or at least it looks odd that we have a function that's returning a channel (with likely the purpose to handle synchronisation), but that contract to be "broken" (i.e., it returns before it's fully done).
Perhaps it's "ok" to use the ctr.Lock() -> Unlock
to work around that issue as a short-term solution; mostly wondering if we're still tapering over the underlying issue, and if this is something ideally fixed in Wait
(or elsewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @thaJeztah,
I was thinking about the same, but I didn't find a better solution for container.State.Wait
.
Ideally, all code which wants to access (e.g. read) state of container should use ctr.Lock()
+ ctr.Unlock()
. It's pretty clear requirement which is already met (even before this PR). daemon.Daemon.containerStop
method (within its current implementation) doesn't need to access state of container directly - all it needs is to wait for container to be stopped and this requirement is in place even before this PR (please note that with this PR we make /containers/{id}/stop Docker Engine API slower by introducing additional synchronization - that's the reason I'm worried about regression).
IMHO, the issue is located not in container.State.Wait
, but in daemon.Daemon.Containers
method and in daemon.Daemon.containersReplica
field ("read-only view", which is used by daemon.Daemon.Containers
method). I understand the need of daemon.Daemon.containersReplica
field, but it looks like this read-only view introduced some cases which lack synchronization with daemon.Daemon.containers
field. #50133 is just one of these cases, other can be in /containers/{id}/kill and /containers/{id}/wait API (just replace /containers/{id}/stop
with these APIs in #50133). Please take a look at testcontainers/testcontainers-go#3197 which uses /containers/{id}/json API (which reads state of container from daemon.Daemon.containers
and not from daemon.Daemon.containersReplica
) - this workaround demonstrates that there is no issue (with container.State.Wait
) when using daemon.Daemon.containers
field with ctr.Lock()
+ ctr.Unlock()
.
Maybe we should realize that all Docker Engine API which uses daemon.Daemon.containersReplica
field can get outdated (delayed) state of containers and highlight that in API documentation (e.g. split read-only API from read-write API, then describe synchronization b/w API calls). Otherwise, even without concurrent access to Docker Engine API we leave users of this API without understanding of:
- Synchronization points (i.e. where slowness / lock can happen).
- Visibility of state changes.
Alternative solution (the way this PR goes, but only for /containers/{id}/stop API endpoint) is to ensure that daemon.Daemon.containersReplica
is updated before completion of any API call which modifies state of container(s).
Thank you.
…es when container is stopped and before API response is sent (fix for moby#50133). Signed-off-by: Marat Abrarov <[email protected]>
ed75954
to
8e6cd44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to have a fix for that but... we should really take a closer look at that because this is horrifying 🙈 😅
Thanks for digging and investigating that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
we discussed this one in the maintainers call Yesterday; looks like the PR for now should be ok to bring in, but we should probably create a tracking ticket to dig deeper into the flow(s) and see where improvements / fixes can be applied.
// Ensure container status changes are committed by handler of container exit before returning control to the caller | ||
ctr.Lock() | ||
defer ctr.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this caused a regression; see docker/compose#12950 (comment)
The ctr.Lock()
makes this function wait for the container to die, but because it produces the stop
event in a defer
, the stop
event now gets produced AFTER the die
event;
Lines 74 to 78 in 8e6cd44
defer func() { | |
if retErr == nil { | |
daemon.LogContainerEvent(ctr, events.ActionStop) | |
} | |
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the defer
was introduced by me in 952902e - but that change doesn't look critical; mostly cleanup; we can probably inline it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
daemon: ensuring state of stopped container is visible to other queries when container is stopped using /containers/{id}/stop API and before API response is sent back to the client. Fixes #50133.
- What I did
Made state of stopped container visible to other queries when container is stopped using /containers/{id}/stop Docker Engine API and before response is sent back to the API client.
- How I did it
Waiting until state of stopped container is committed to daemon database (only when container is stopped successfully), i.e. until container stop handler of container monitor completes, before replying to client of /containers/{id}/stop Docker Engine API.
- How to verify it
Refer to "Reproduce" section of #50133.
- Human readable description for the release notes