moby: docker commit slow on btrfs storage driver
Description of problem: I am using btrfs as a storage driver, because my containers can contain up to 60GB of new data. I thought btrfs, being a copy on write file system, should be able to handle 60GB easy and fast. In reality docker needs up to 40 minutes to commit a container with 60GB of new data.
In log we see that committing a container with btrfs storage is actually doing some ‘Start untar layer’ process:
[debug] server.go:999 Calling POST /commit
2015/01/13 09:06:39 POST /v1.12/commit?author=&comment=&container=b7b6d2056c71&repo=day13_log&tag=
[0ace70fe] +job commit(b7b6d2056c71)
[debug] image.go:90 Start untar layer
[debug] archive.go:75 [tar autodetect] n: [111 117 116 0 0 0 0 0 0 0]
[debug] image.go:94 Untar time: 2400.110357412s
[0ace70fe] -job commit(b7b6d2056c71) = OK (0)
But all modifications during container-lifetime happen in the btrfs subvolume already, hence the commit should be as fast as a btrfs snapshot is.
How reproducible:
- docker run -itd ubuntu:14.04 /bin/bash
- make big changes on the container
- docker commit CONTAINER IMAGE_NAME
docker version
:
Client version: 1.4.1
Client API version: 1.16
Go version (client): go1.3.3
Git commit (client): 5bc2ff8
OS/Arch (client): linux/amd64
Server version: 1.4.1
Server API version: 1.16
Go version (server): go1.3.3
Git commit (server): 5bc2ff8
docker info
:
Containers: 477
Images: 649
Storage Driver: btrfs
Execution Driver: native-0.2
Kernel Version: 3.13.0-43-generic
Operating System: Ubuntu 14.04.1 LTS
CPUs: 8
Total Memory: 31.1 GiB
ID:
Debug mode (server): true
Debug mode (client): false
Fds: 24
Goroutines: 30
EventsListeners: 0
Init Path: /usr/bin/docker
Docker Root Dir: /docker
WARNING: No swap limit support
uname -a
:
Linux lx119-tpo 3.13.0-43-generic #72-Ubuntu SMP Mon Dec 8 19:35:06 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
lsb_release -a
:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.1 LTS
Release: 14.04
Codename: trusty
About this issue
- Original URL
- State: open
- Created 9 years ago
- Comments: 27 (9 by maintainers)
No, I don’t want volume snapshots, I just want #41109 . 😃 I never use “push” at all, so this “optimization” is completely wasted on me. I want no space+performance overhead for tars - only btrfs snapshots.
I would assume that:
(3) Compressing the layer is not necessary (or rather harmful) on btrfs, because it defeats the purpose of using btrfs in the first place; the archive will often be [much] bigger than the copy-on-write difference, and if you want compression, you use btrfs compression.
(2) “Creating a layer” is as simple as taking a snapshot - and with btrfs, that layer also includes all the previous layers, so you don’t have to worry about diffs. I thought this was the reason for using btrfs for managing layers! Btrfs was literally made for this!
(1) When you commit a (usually already stopped) container, usually before removing it, you probably will not want to see its diff anymore, and you don’t need that diff for layer management because btrfs takes care of that for you.
I understand that the developers’ thinking was probably different, and most likely for a good reason, but I can’t imagine a good reason to tar btrfs subvolumes instead of relying on snapshots to do all the work for you (usually in less than a second). That’s the whole point of btrfs.
Searching for any information on this topic yielded this: https://docs.docker.com/storage/storagedriver/btrfs-driver/#how-the-btrfs-storage-driver-works . According to this, it is supposed to work the way I imagined: this page explains how everything is stored as subvolumes and snapshots, and is very efficient. But according to you (and our experience with this issue), it is not. And I couldn’t find the place in the sources where this is happening…
Then I’ve found #443 (apparently, you were actually the btrfs backend implementation author?). Why insist on file-based layers when it is basically reimplementing what btrfs already does wery efficiently, but in a very inefficient way? Not only that, but apparently they are making it even worse, duplicating all the CoW files by unpacking tars instead of taking snapshots…
@umatomba , not on the issue, however I think you misuse docker. You should NOT save data in the docker. It is very weird/wrong to see a docker 60 GB. Data should go to volumes, outside docker. docker should contain ONLY the libraries and programs, not the data files, even if they are important to the application. For the reference: https://developers.redhat.com/blog/2016/02/24/10-things-to-avoid-in-docker-containers/
I understand that you’re not supposed to keep your changes inside the container and commit it often, but yes, during
docker build
, starting and stopping containers takes more time than actually building them. Which is why people are forced to merge multiple commands into one long multi-line command, which defeats the reason for committing the intermediate containers in the first place…It would certainly make more sense to generate a tar when you do
docker push
. That happens much less often than committing, and it’s not a big deal if it takes a little more time - but it is a big problem if we have to slow most other operations and take additional space just to be always ready to marginally improve the speed ofdocker push
.I took a look at alternative solutions. According to my knowledge, Podman is known to be “Docker that’s less stubborn” - they “fix” things Docker refuses to fix allegedly for ideological reasons (like init system support). But no, Podman does the same; even worse, it builds the commit tar in /tmp, which I think is a terrible idea (it’s often not large enough, and this means thrashing not only the Docker drive, but also the tmp drive).
One solution that to my surprise I really like so far is systemd-container. I don’t think it adheres to any standard - systemd is not known for that - but I think it is much easier to manage the way you want. It supports an init system (though it really prefers that init system to be systemd), it is meant to be used for services that are autostarted and managed by systemd, and it can run from any folder you want. For example, mount tmpfs somewhere, extract your whole image there, use it at a blazing speed, and then write it back to /var/lib/ or wherever you want. Even with systemd inside of it, it starts and stops several times faster than Docker, and it actually uses btrfs snapshots by default. I think I’ll write my own “snapshotter” script and a set of tools for it, and just use that (what I need is primarily not CI).