moby: Make Docker ps more robust to problem containers

As it stands there are quite a few long standing issues with docker ps hanging.

https://github.com/docker/docker/issues/25993 https://github.com/docker/docker/issues/12606

and so on.

At the root one of the biggest issues is that if a single container is blocking, you can no longer iterate through the container list:

can reducePSContainer be made more robust so it can skip a container if it is locked for more than N seconds and prints out the name of the problem container.

This would make it significantly easier to debug this issue and isolate the underlying cause.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 28 (18 by maintainers)

Most upvoted comments

Indeed I am. Still some issues to work out but it’s coming along. https://github.com/cpuguy83/docker/compare/container_memdb

Basic idea is to strip the container object down to configuration only and move runtime state (e.g. IO streams, log driver, etc) to a separate object. The container object will be on a transactional in-memory store where when requested the daemon gives copies of the container object rather than a pointer reference, which means there is no longer a need to lock this object. It can be used for reporting at any time without risk of blocking.

State manipulation actions (such as start/stop) will still require a lock, but not on the container object itself. This ensures the daemon can still report status, and potentially even warn the user of a problem.

@SamSaffron we are on v1.12.6, but the bug causing the lockup during starts in containerd hasn’t been released yet (it’s on the v1.13.x tree, but not in v1.13.1-rc1): docker/containerd#436

On the docker side, it seems valid to hold a lock during (*Daemon).containerStart(). I’m not sure that releasing the lock on a timeout would be safe (a container could be orphaned in containerd in a weird state?), so it looks like the fix on the docker daemon side is not requiring locks for each container in (*Daemon).reducePsContainer().

I think we should pull in the “don’t hold container lock on exec exit” fix for a hypothetical 1.12.x at the very least.

On master I don’t think we are even taking the container lock here, which seems to make sense.