moby: Docker does not free up disk space after container, volume and image removal
Similar to #21925 (but didn’t look like I should post on there).
Description
I have some docker hosts running CI builds, nightly all docker data is removed from them, but /var/lib/docker/overlay2
keeps consuming more space.
If I remove all docker data, e.g I just did:
docker rm -vf $(docker ps -aq)
docker rmi -f $(docker images -aq)
docker volume prune -f
docker system prune -a -f
There’s still a few GB tied up at /var/lib/docker/overlay2
:
[root@*** docker]# du -sh /var/lib/docker/overlay2/
5.7G /var/lib/docker/overlay2/
These files are not left over from a prior upgrade as I upgraded and rm -rf /var/lib/docker/*
yesterday.
Steps to reproduce the issue:
Unfortunately I don’t have a simple set of steps to reproduce this that are fast and shareable - fortunately I can reliably check our CI nodes each morning and they are in this state, so with some help we can probably get to a repro case.
Describe the results you received:
More space is consumed by /var/lib/docker/overlay2
over time despite all attempts to clean up docker using its inbuilt commands.
Describe the results you expected:
Some way to clean out image, container and volume data.
Additional information you deem important (e.g. issue happens only occasionally):
There’s obviously some reference between /var/lib/docker/image
and /var/lib/docker/overlay2
, but I don’t understand exactly what it is.
With docker reporting no images:
[root@*** docker]# docker images -aq
[root@*** docker]#
I can see an ID for one of the base images we built a lot of stuff on top of:
[root@*** docker]# find image/ | grep 89afeb2e357b
image/overlay2/distribution/diffid-by-digest/sha256/89afeb2e357b60b596df9a1eeec0b32369fddc03bf5f54ce246d52f97fa0996c
If I run something in that image, the output is weird:
[root@*** docker]# time docker run -it --rm ringo/scientific:6.8 true
Unable to find image 'ringo/scientific:6.8' locally
6.8: Pulling from ringo/scientific
89afeb2e357b: Already exists
Digest: sha256:cb016e92a510334582303b9904d85a0266b4ecdb176b68ccb331a8afe136daf4
Status: Downloaded newer image for ringo/scientific:6.8
real 0m3.305s
user 0m0.026s
sys 0m0.022s
Weird things about that output:
- says it’s not local
- but then
89afeb2e357b
already exists - says it’s “Downloading newer image” but then runs it a lot faster than it could if it had actually downloaded the image
If I then delete all images again:
[root@*** docker]# docker rmi -f $(docker images -qa)
Untagged: ringo/scientific:6.8
Untagged: ringo/scientific@sha256:cb016e92a510334582303b9904d85a0266b4ecdb176b68ccb331a8afe136daf4
Deleted: sha256:dfb081d8a404885996ba1b2db4cff7652f8f8d18acab02f9b001fb17a4f71603
[root@*** docker]#
Disable the current overlay2 dir with docker stopped:
[root@*** docker]# systemctl stop docker
[root@*** docker]# mv /var/lib/docker/overlay2{,.disabled}
[root@*** docker]# systemctl start docker
It does indeed error out looking for the overlay2 counterpart:
[root@*** docker]# time docker run -it --rm ringo/scientific:6.8 true
Unable to find image 'ringo/scientific:6.8' locally
6.8: Pulling from ringo/scientific
89afeb2e357b: Already exists
Digest: sha256:cb016e92a510334582303b9904d85a0266b4ecdb176b68ccb331a8afe136daf4
Status: Downloaded newer image for ringo/scientific:6.8
docker: Error response from daemon: lstat /var/lib/docker/overlay2/bb184df27a8fc64cb5a00a42cfe106961cc5152e6d0aba88b491e3b56315fbac: no such file or directory.
See 'docker run --help'.
real 0m3.053s
user 0m0.021s
sys 0m0.020s
Output of docker version
:
[root@*** docker]# docker version
Client:
Version: 17.03.1-ce
API version: 1.27
Go version: go1.7.5
Git commit: c6d412e
Built: Mon Mar 27 17:05:44 2017
OS/Arch: linux/amd64
Server:
Version: 17.03.1-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.7.5
Git commit: c6d412e
Built: Mon Mar 27 17:05:44 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
[root@*** docker]# docker info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 17.03.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-514.10.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.638 GiB
Name: ***
ID: FXGS:5RTR:ASN7:KKB3:TVTN:PFWV:RHDY:XYMG:7RWK:CPG4:YNVB:TBIC
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):
oVirt VM in a company cloud running stock CentOS 7 and SELinux. Docker installed from docker.com packages.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 57
- Comments: 59 (15 by maintainers)
I don’t know if everyone is gone, but here are some tips and tricks. just make the docker system cleanup a chronjob: https://nickjanetakis.com/blog/docker-tip-32-automatically-clean-up-after-docker-daily
Start again by finding which is the culprit directory:
‘df -hx --max-depth=1 /’ and ‘df -h’
for me the culprit was docker: /var/lib/docker/overlay2/ short team this works:
docker system prune -a -f
Long term: Run: ‘crontab -e’
insert this: ‘0 3 * * * /usr/bin/docker system prune -f’
@Kenji-K Not exactly. I stop the docker service nightly and
rm -rf /var/lib/docker
now, so at least it’s “stable”.Imo this should be reopened.
“only having to nuke /var/lib/docker every few weeks” is not a satisfactory resolution for me.
Docker should stop leaking disk space if you regularly run system prune
@wannymiarelli that’s expected; images, containers, and volumes take up space. Have a look at
docker system prune
for easier cleanup than the command you showed.I found a lot of unused images causing this issue, resolved running
docker rmi $(docker images -q)
We are running into this issue too:
I think this issue should be reopened. Let me know if there is any other info I can provide to debug this issue.
Thanks @neerolyte, let me go ahead and close this one 👍
@cpuguy83 @thaJeztah So just to clarify - is there actually some known safe way to clean up container and image data in any version of docker atm?
Because atm I’m stopping the service and just rm’ing stuff under the hood - but even with that I end up with overlay mounts dangling every now and then and have to actually reboot the box.
@thaJeztah it feels like the issue should be reopened 😄
I was able to recover 32 million inodes and 500GB of storage in
/var/lib/docker/overlay2/
by removing unused GitLab Docker images and containers:docker system prune -fa
I don’t want to reopen this issue, because became somewhat of a “kitchen sink” of “possibly related, but could be different issues”. Some issues were fixed, and other reports really need more details to investigate if there’s still issues to address. It’s better to start a fresh ticket (but feel free to link to this issue).
Looking at your earlier comment
First of all, I really don’t recommend manually removing directories from under
/var/lib/docker
as those are managed by the docker daemon, and removing files could easily mess up state. If that directory belonged to a running container, then removing it won’t actually remove the files until they’re no longer in use (and/or unmounted). See https://unix.stackexchange.com/a/68532I see you’re mentioning you’re running node-exporter (https://github.com/prometheus/node_exporter), which can bind-mount the whole system’s filesystem into the container. (Possibly depending on mount-propagation settings), this can be problematic. If you bind-mount
/var/lib/docker
into a container, that bind-mount can “pull in” all mounts from other containers into that container’s mount namespace, which means that none of those container’s filesystems can be removed until the node-exporter container is removed (files unmounted). I believe there were also some kernel bugs that could result in those mounts to never be released.As to differences between
df
anddu
, I’d have to do testing, but the combination of mounts being in different namespaces, together with overlay filesystems could easily lead to inconsistencies in reporting the actual space used (e.g. multiple containers sharing the same image would each mount the same image layers, but tools could traverse those mounts and account their size multiple times).I see some mention of snaps and lxc in your comment; this could be unrelated, but if you installed docker using the snap packages, those packages are maintained by Canonical, and I’ve seen many reports of those being problematic; I’d recommend (if possible) to test if it also reproduces on the official packages (https://docs.docker.com/engine/install/ubuntu/)
(per earlier comments above); it’s possible that files are still in use (which could be by stopped containers or untagged images); in that case, the disk use may be legitimate; if possible, verify if disk space does go down after removing all images and containers.
If you think there’s a bug at hand, and have ways to reproduce the issue so that it can be looked into, feel free to open a new ticket, but when doing so;
The only possible solution is to stop the service, then to delete the
/var/lib/docker/*
manually. Seriously this “product” never works correctly…@thaJeztah
Could you please let me know of any actual bug numbers relating to this?
I should also mention that based on https://github.com/docker/docker/issues/24023 I’ve switched to running overlay2 (instead of the default of overlay 1 on CentOS 7). The issue exists against both overlay and overlay2, so I think it’s docker internals and not storage driver specific.
I’ve configured overlay2 by modifying
daemon.json
:and yes I cleaned docker before starting up docker with the new driver by running
rm -rf /var/lib/docker/*
.I’m a bit “hazy” on the exact details (I know @kolyshkin and @cpuguy83 dove more deeply into this when debugging some “nasty” situations), but my “layman explanation” is that “container A” has mounts in it’s own namespace (and thus only visible within that namespace), now if “container B” mounts those paths, those paths can only be unmounted if both “container B” and “container A” unmount them. But things can become more tricky than that; if “container A” has mounts with mount-propagation set (slave? shared?), the mounts of “container B” will also propagate to “container A”, and now there’s an “infinite loop” (container B’s mounts cannot be unmounted until container A’s mounts are unmounted, which cannot be unmounted until container A’s mounts are unmounted).
We have the same issue on docker
and “du” thinks different about the usage, it doesn’t see the 82gb in use.
But even after deleting that folder that shows up in ‘df’ (I didn’t care to recreate node-exporter, it was his folder) , the space didn’t clean up, and the mount for ‘/var’ shows this
Docker version 19.03.5, build 633a0ea838
the docker was restarted, container recreated (it had only node-exporter running!) , lsof can’t find any deleted open files, the container PID1 doesn’t show any open files. The ‘df’ was showing atany ideas why “df” calculates used space in a way other tools can’t find the invisible files/ghosts? Should I inspect the filesystem somehow with advanced tools to understand what’s going on?
@ripcurld0 I don’t know. We don’t use CentOS or RHEL (ubuntu 16.04 latest patches), It happens with both AuFS and devicemapper (overlay don’t know) every time the partition goes out of space. We never use
-f
(and anyway if it is not safe for some reason, it should not be available). So yeah, nuking docker from time to time is currently our only option. But recently in another project with Docker Swarm we have to restart docker because the nodes were no longer able to communicate (even after rm / create the service). So I am guessing we’re getting used to downtime…@ripcurld0 We’re running Ubuntu 16.04 w/ 4.9.0-040900-generic kernel, still seeing this issue.
I have the problem as well. It seems that when Docker encounters a “no space left on device error”, it is no longer able to reclaim space.
Just like @arturopie, I am running in the same issue:
I’m also not running k8s and I don’t have any loopback devices.
What seems to be keeping this zombie disk space in use is that I have about 1500 mounts from elements in docker’s overlay filesystem. Here’s an exerpt:
Unfortunately, even after cleaning up the mounts, no disk space is being freed and I still have to do a manual cleanup. The reboot doesn’t help.
@chr0n1x The only mounts that should exist in
/var/lib/docker
are mounts for running containers. If docker is not running there then you should be able to safely unmount them.@frankh Sorry, I don’t think I was clear enough, I’m no longer running
docker system prune
. We only clear containers and untagged images regularly now - not everything else. When the stacks change under our build I think we’re just keeping some tags that no longer make sense - but the effort to “improve that” is higher than just leaving the existing (working) nuking process in place that kicks in when disk space gets low.TLDR - if there is a bug still here, I no longer have a reliable process for reproducing it and even if I did, it’s not high on our list of issues.
rm -rf /var/lib/docker/tmp/*
helped a bit (18GB of files of names like GetImageBlob537029535)@jostyee FYI Ubuntu 16.04.3 LTS with 4.4.0-1030-aws kernel with aufs instead of overlay/overlay2 seems to run stable.
Don’t use
docker rm -f
… the-f
causes Docker to ignore errors and remove the container from memory anyway. This should be better in 17.06 (fewer errors and-f
no longer ignores errors).Some quick questions;
/var/lib/docker
directory between the “docker in docker” container and the host?Also note, that removing containers with
-f
can result in layers being left behind; when using-f
, docker currently ignores “filesystem in use” errors, and removes the container, even if it failed to remove the actual layers. This is a known issue currently, and being looked into to change that behavior