moby: Make Docker ps more robust to problem containers
As it stands there are quite a few long standing issues with docker ps
hanging.
https://github.com/docker/docker/issues/25993 https://github.com/docker/docker/issues/12606
and so on.
At the root one of the biggest issues is that if a single container is blocking, you can no longer iterate through the container list:
can reducePSContainer
be made more robust so it can skip a container if it is locked for more than N seconds and prints out the name of the problem container.
This would make it significantly easier to debug this issue and isolate the underlying cause.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 28 (18 by maintainers)
Indeed I am. Still some issues to work out but it’s coming along. https://github.com/cpuguy83/docker/compare/container_memdb
Basic idea is to strip the container object down to configuration only and move runtime state (e.g. IO streams, log driver, etc) to a separate object. The container object will be on a transactional in-memory store where when requested the daemon gives copies of the container object rather than a pointer reference, which means there is no longer a need to lock this object. It can be used for reporting at any time without risk of blocking.
State manipulation actions (such as start/stop) will still require a lock, but not on the container object itself. This ensures the daemon can still report status, and potentially even warn the user of a problem.
@SamSaffron we are on
v1.12.6
, but the bug causing the lockup during starts in containerd hasn’t been released yet (it’s on thev1.13.x
tree, but not inv1.13.1-rc1
): docker/containerd#436On the docker side, it seems valid to hold a lock during
(*Daemon).containerStart()
. I’m not sure that releasing the lock on a timeout would be safe (a container could be orphaned in containerd in a weird state?), so it looks like the fix on the docker daemon side is not requiring locks for each container in(*Daemon).reducePsContainer()
.I think we should pull in the “don’t hold container lock on exec exit” fix for a hypothetical 1.12.x at the very least.
On master I don’t think we are even taking the container lock here, which seems to make sense.