containerd: content store corrupted by gc on restart

A user reported in (https://github.com/linuxkit/linuxkit/issues/2632) that rebooting a cri-containerd based LinuxKit/Kube system did not work.

I booted a kube system with a fresh persistent disk and ran kubeadm-init.sh, waited until kubectl --namespace=kube-system get -o wide pods reports all pods were running then recorded the current state with find /var/lib/containerd/ > /var/lib/containerd.find (nb: /var/lib is the persistent disk), I then rebooted.

On reboot no pods are running (and the kube apiserver is not available), in /var/log/cri-containerd.err.log I see lots of:

W1023 15:42:13.358177     550 restart.go:324] Failed to get image config for "gcr.io/google_containers/etcd-amd64:3.0.17": content digest sha256:243830dae7dd6ff78859fa1d66098a89e2951a9e95af4ef4d4d2c03d97975771: not found
W1023 15:42:13.364271     550 restart.go:324] Failed to get image config for "gcr.io/google_containers/etcd-amd64@sha256:d83d3545e06fb035db8512e33bd44afb55dea007a3abd7b17742d3ac6d235940": content digest sha256:243830dae7dd6ff
78859fa1d66098a89e2951a9e95af4ef4d4d2c03d97975771: not found

For lots (I think all) images.

Running ctr -n k8s.io images ls produces a similar message:

ERRO[0000] failed resolving platform for image gcr.io/google_containers/etcd-amd64:3.0.17  error="content digest sha256:243830dae7dd6ff78859fa1d66098a89e2951a9e95af4ef4d4d2c03d97975771: not found"
ERRO[0000] failed resolving platform for image gcr.io/google_containers/etcd-amd64@sha256:d83d3545e06fb035db8512e33bd44afb55dea007a3abd7b17742d3ac6d235940  error="content digest sha256:243830dae7dd6ff78859fa1d66098a89e2951a9e95af4ef4d4d2c03d97975771: not found"

it does also produce the expected output.

Redoing the find above and comparing I see a bunch of stuff has been nuked:

# diff -u /var/lib/containerd.find /var/lib/containerd.fi
nd2
--- /var/lib/containerd.find
+++ /var/lib/containerd.find2
@@ -8515,59 +8515,17 @@
 /var/lib/containerd/io.containerd.content.v1.content
 /var/lib/containerd/io.containerd.content.v1.content/blobs
 /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/4adaf48a499402abde39dd4c216c8a5a725a5b6c210b8a68022d4382d9f35f09
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/1ad426d140e5e15b6584f3dbad0e98084993f02ad95e3cba2086b0b054a04033
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/44cb8dd9693bd1ce8f8b624399a1a6e940f48e40791a419d73a7fd9ec88a62fb
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/bef496bb33474aa55ae37c8523e7e2ab279370c381668d245b923594f3346953
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/b3e4f4e87c35f4992f7b3c333716749547aa18d5a52a82212d7d863591081f1d
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/e24ccd46aab5467efa0e6f855309b198d02107b1c7e532d1c9318284129547eb
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5062407518d77ea05225c5a4b73d1ebf8dbb7706da66257fa1da42388b8d70d1
 /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/d9a1ab41ba39939511077ce2da44774daafaaecc6460a62358c6e0428831b7d2
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/280aca6ddce2bc2b0904fb525b14e119111c11ca06c3c4d9e4258723da52aecd
 /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/46b933bb70270c8a02fa6b6f87d440f6f1fce1a5a2a719e164f83f7b109f7544

Looks like 12/54 blobs have gone.

I tried:

ctr -n k8s.io pull gcr.io/google_containers/etcd-amd64@sha256:d83d3545e06fb035db8512e33bd44afb55dea007a3abd7b17742d3ac6d235940

and a few of the blobs came back, I did it for all images and ctr -n k8s.io images ls is now clean and happy again, cri-containerd seems to still be stuck in the bad state.

As an experiment I tried:

diff --git a/metadata/db.go b/metadata/db.go
index 510d14a2..34e3d35b 100644
--- a/metadata/db.go
+++ b/metadata/db.go
@@ -198,6 +198,7 @@ func (m *DB) GarbageCollect(ctx context.Context) error {
 		log.G(ctx).WithField("d", time.Now().Sub(lt1)).Debug("metadata garbage collected")
 	}()
 
+	return nil
 	var marked map[gc.Node]struct{}
 
 	if err := m.db.View(func(tx *bolt.Tx) error {

and the issue was resolved (I could init and reboot without issues)

I didn’t manage to trigger the issue with a local start, pull, stop, restart so I think it must be more complex to trigger. Happy to give pointers on the LinuxKit kube stuff if that seems like the best repro.

/cc @dmcgowan

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 20 (20 by maintainers)

Most upvoted comments

@Random-Liu cri-containerd HEAD is not working either but also still missing this commit. I also tried the containerd v1.0.0-beta.2 only disk with removing/adding containers, images, tasks and couldnt reproduce.

w9n on Oct 23, 2017