moby: docker commit slow on btrfs storage driver

Description of problem: I am using btrfs as a storage driver, because my containers can contain up to 60GB of new data. I thought btrfs, being a copy on write file system, should be able to handle 60GB easy and fast. In reality docker needs up to 40 minutes to commit a container with 60GB of new data.

In log we see that committing a container with btrfs storage is actually doing some ‘Start untar layer’ process:

    [debug] server.go:999 Calling POST /commit
    2015/01/13 09:06:39 POST /v1.12/commit?author=&comment=&container=b7b6d2056c71&repo=day13_log&tag=
    [0ace70fe] +job commit(b7b6d2056c71)
    [debug] image.go:90 Start untar layer
    [debug] archive.go:75 [tar autodetect] n: [111 117 116 0 0 0 0 0 0 0]
    [debug] image.go:94 Untar time: 2400.110357412s
    [0ace70fe] -job commit(b7b6d2056c71) = OK (0)

But all modifications during container-lifetime happen in the btrfs subvolume already, hence the commit should be as fast as a btrfs snapshot is.

How reproducible:

docker run -itd ubuntu:14.04 /bin/bash
make big changes on the container
docker commit CONTAINER IMAGE_NAME

docker version: Client version: 1.4.1 Client API version: 1.16 Go version (client): go1.3.3 Git commit (client): 5bc2ff8 OS/Arch (client): linux/amd64 Server version: 1.4.1 Server API version: 1.16 Go version (server): go1.3.3 Git commit (server): 5bc2ff8

docker info: Containers: 477 Images: 649 Storage Driver: btrfs Execution Driver: native-0.2 Kernel Version: 3.13.0-43-generic Operating System: Ubuntu 14.04.1 LTS CPUs: 8 Total Memory: 31.1 GiB ID: Debug mode (server): true Debug mode (client): false Fds: 24 Goroutines: 30 EventsListeners: 0 Init Path: /usr/bin/docker Docker Root Dir: /docker WARNING: No swap limit support

uname -a: Linux lx119-tpo 3.13.0-43-generic #72-Ubuntu SMP Mon Dec 8 19:35:06 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

lsb_release -a: No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty

About this issue

Original URL
State: open
Created 9 years ago
Comments: 27 (9 by maintainers)

Most upvoted comments

No, I don’t want volume snapshots, I just want #41109 . 😃 I never use “push” at all, so this “optimization” is completely wasted on me. I want no space+performance overhead for tars - only btrfs snapshots.

dark-penguin on Jun 16, 2020

I would assume that:

(3) Compressing the layer is not necessary (or rather harmful) on btrfs, because it defeats the purpose of using btrfs in the first place; the archive will often be [much] bigger than the copy-on-write difference, and if you want compression, you use btrfs compression.

(2) “Creating a layer” is as simple as taking a snapshot - and with btrfs, that layer also includes all the previous layers, so you don’t have to worry about diffs. I thought this was the reason for using btrfs for managing layers! Btrfs was literally made for this!

(1) When you commit a (usually already stopped) container, usually before removing it, you probably will not want to see its diff anymore, and you don’t need that diff for layer management because btrfs takes care of that for you.

I understand that the developers’ thinking was probably different, and most likely for a good reason, but I can’t imagine a good reason to tar btrfs subvolumes instead of relying on snapshots to do all the work for you (usually in less than a second). That’s the whole point of btrfs.

Searching for any information on this topic yielded this: https://docs.docker.com/storage/storagedriver/btrfs-driver/#how-the-btrfs-storage-driver-works . According to this, it is supposed to work the way I imagined: this page explains how everything is stored as subvolumes and snapshots, and is very efficient. But according to you (and our experience with this issue), it is not. And I couldn’t find the place in the sources where this is happening…

Then I’ve found #443 (apparently, you were actually the btrfs backend implementation author?). Why insist on file-based layers when it is basically reimplementing what btrfs already does wery efficiently, but in a very inefficient way? Not only that, but apparently they are making it even worse, duplicating all the CoW files by unpacking tars instead of taking snapshots…

dark-penguin on Jun 13, 2020

@umatomba , not on the issue, however I think you misuse docker. You should NOT save data in the docker. It is very weird/wrong to see a docker 60 GB. Data should go to volumes, outside docker. docker should contain ONLY the libraries and programs, not the data files, even if they are important to the application. For the reference: https://developers.redhat.com/blog/2016/02/24/10-things-to-avoid-in-docker-containers/

micgrivas on Feb 27, 2018

I understand that you’re not supposed to keep your changes inside the container and commit it often, but yes, during docker build, starting and stopping containers takes more time than actually building them. Which is why people are forced to merge multiple commands into one long multi-line command, which defeats the reason for committing the intermediate containers in the first place…

It would certainly make more sense to generate a tar when you do docker push. That happens much less often than committing, and it’s not a big deal if it takes a little more time - but it is a big problem if we have to slow most other operations and take additional space just to be always ready to marginally improve the speed of docker push.

I took a look at alternative solutions. According to my knowledge, Podman is known to be “Docker that’s less stubborn” - they “fix” things Docker refuses to fix allegedly for ideological reasons (like init system support). But no, Podman does the same; even worse, it builds the commit tar in /tmp, which I think is a terrible idea (it’s often not large enough, and this means thrashing not only the Docker drive, but also the tmp drive).

One solution that to my surprise I really like so far is systemd-container. I don’t think it adheres to any standard - systemd is not known for that - but I think it is much easier to manage the way you want. It supports an init system (though it really prefers that init system to be systemd), it is meant to be used for services that are autostarted and managed by systemd, and it can run from any folder you want. For example, mount tmpfs somewhere, extract your whole image there, use it at a blazing speed, and then write it back to /var/lib/ or wherever you want. Even with systemd inside of it, it starts and stops several times faster than Docker, and it actually uses btrfs snapshots by default. I think I’ll write my own “snapshotter” script and a set of tools for it, and just use that (what I need is primarily not CI).

dark-penguin on Jun 13, 2020