moby: Error restoring container from checkpoint: "No parent found for mountpoint"

Description

Very sporadically, I get the following type of error when trying to restore a container from a checkpoint, which causes the restore to fail:

“Error (criu/mount.c:360): mnt: No parent found for mountpoint 574 (@./var/lib/docker/checkpoints/test_cp/criu.work/restore-2017-03-08T15:12:42Z/.criu.cgyard.fnEP8f/systemd)”

The full criu restore.log is attached.

restore.txt

It seems to sporadically happen when I restore several containers from the same checkpoint at the same time. The issue reproduced 8/720 times when I ran the following script, which just starts and stops a bunch of simple Ubuntu containers over and over (renamed to satisfy github):

stress_cr_simple_start_stop.txt

I gave the script the following parameters: ./stress_cr_simple_start_stop.sh 10 720; this will go through 720 rounds of restoring 10 containers at a time from the same checkpoint.

I wonder if this has anything to do with the way docker assigns the criu work dir for each restore. Docker creates a separate work directory for each restore; the directory names are differentiated with a datetimestamp with second precision. So, this means that if docker makes 2+ restore calls to criu within a second, the work dir for those restores could be the same. As an example, when I ran the script above for 720 runs with 10 concurrent restores at a time (7200 restores total), there were only 1087 different work directories created. The path that the criu error cites appears to be an ephemeral path that is created during the restore and which sits inside the work dir. If there are multiple concurrent restores happening for the same checkpoint, it therefore seems possible that they could be stepping on each others’ toes if they are using the same work dir.

Steps to reproduce the issue:

  1. ./stress_cr_simple_start_stop.sh 10 720 (can be a smaller number than 720, but should probably be at least 100)
  2. Observe restore failures during script execution, or grep for them afterward: grep -l "Error" /var/lib/docker/checkpoints/test_cp/criu.work/**/*

Describe the results you received:

Restore fails with message like “Error (criu/mount.c:360): mnt: No parent found for mountpoint 574 (@./var/lib/docker/checkpoints/test_cp/criu.work/restore-2017-03-08T15:12:42Z/.criu.cgyard.fnEP8f/systemd)”

Describe the results you expected:

Restores should always succeed, especially for a basic Ubuntu container.

Output of docker version:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 07:57:58 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 07:57:58 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 27
Server Version: 17.03.0-ce
Storage Driver: overlay
 Backing Filesystem: extfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 apparmor
Kernel Version: 4.10.1-041001-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 120.1 GiB
Name: ip-10-97-0-35
ID: PBSJ:KR3H:XP7F:KQMN:CFJS:J75A:ZGUM:ZSC3:5DUR:MCZJ:5ACD:CXM7
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

Using AWS and criu 2.11.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 23 (20 by maintainers)

Commits related to this issue

Most upvoted comments

Should be fixed in 2.12.1 (released today)

@boucher , @tswift242 I managed to replicate it so will try to troubleshoot and submit a pull request.

We should probably change the naming strategy to eliminate the concurrency issue, regardless of whatever else might be causing this.

I did notice that there are 2 other failures in that log file as well, I’m not sure which are the most relevant:

(00.972837) Error (criu/mount.c:2798): mnt: Can't remount root with MS_PRIVATE: Invalid argument
(00.972861) Error (criu/mount.c:2808): mnt: Can't unmount /tmp/.criu.mntns.eFYx0x: Invalid argument

As an aside, I see a higher rate of failure than this in my current production setup, but haven’t spent much time digging into the issues.