containerd: content store corrupted by gc on restart
A user reported in (https://github.com/linuxkit/linuxkit/issues/2632) that rebooting a cri-containerd
based LinuxKit/Kube system did not work.
I booted a kube system with a fresh persistent disk and ran kubeadm-init.sh
, waited until kubectl --namespace=kube-system get -o wide pods
reports all pods were running then recorded the current state with find /var/lib/containerd/ > /var/lib/containerd.find
(nb: /var/lib
is the persistent disk), I then rebooted.
On reboot no pods are running (and the kube apiserver is not available), in /var/log/cri-containerd.err.log
I see lots of:
W1023 15:42:13.358177 550 restart.go:324] Failed to get image config for "gcr.io/google_containers/etcd-amd64:3.0.17": content digest sha256:243830dae7dd6ff78859fa1d66098a89e2951a9e95af4ef4d4d2c03d97975771: not found
W1023 15:42:13.364271 550 restart.go:324] Failed to get image config for "gcr.io/google_containers/etcd-amd64@sha256:d83d3545e06fb035db8512e33bd44afb55dea007a3abd7b17742d3ac6d235940": content digest sha256:243830dae7dd6ff
78859fa1d66098a89e2951a9e95af4ef4d4d2c03d97975771: not found
For lots (I think all) images.
Running ctr -n k8s.io images ls
produces a similar message:
ERRO[0000] failed resolving platform for image gcr.io/google_containers/etcd-amd64:3.0.17 error="content digest sha256:243830dae7dd6ff78859fa1d66098a89e2951a9e95af4ef4d4d2c03d97975771: not found"
ERRO[0000] failed resolving platform for image gcr.io/google_containers/etcd-amd64@sha256:d83d3545e06fb035db8512e33bd44afb55dea007a3abd7b17742d3ac6d235940 error="content digest sha256:243830dae7dd6ff78859fa1d66098a89e2951a9e95af4ef4d4d2c03d97975771: not found"
it does also produce the expected output.
Redoing the find
above and comparing I see a bunch of stuff has been nuked:
# diff -u /var/lib/containerd.find /var/lib/containerd.fi
nd2
--- /var/lib/containerd.find
+++ /var/lib/containerd.find2
@@ -8515,59 +8515,17 @@
/var/lib/containerd/io.containerd.content.v1.content
/var/lib/containerd/io.containerd.content.v1.content/blobs
/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/4adaf48a499402abde39dd4c216c8a5a725a5b6c210b8a68022d4382d9f35f09
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/1ad426d140e5e15b6584f3dbad0e98084993f02ad95e3cba2086b0b054a04033
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/44cb8dd9693bd1ce8f8b624399a1a6e940f48e40791a419d73a7fd9ec88a62fb
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/bef496bb33474aa55ae37c8523e7e2ab279370c381668d245b923594f3346953
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/b3e4f4e87c35f4992f7b3c333716749547aa18d5a52a82212d7d863591081f1d
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/e24ccd46aab5467efa0e6f855309b198d02107b1c7e532d1c9318284129547eb
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5062407518d77ea05225c5a4b73d1ebf8dbb7706da66257fa1da42388b8d70d1
/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/d9a1ab41ba39939511077ce2da44774daafaaecc6460a62358c6e0428831b7d2
-/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/280aca6ddce2bc2b0904fb525b14e119111c11ca06c3c4d9e4258723da52aecd
/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/46b933bb70270c8a02fa6b6f87d440f6f1fce1a5a2a719e164f83f7b109f7544
Looks like 12/54 blobs have gone.
I tried:
ctr -n k8s.io pull gcr.io/google_containers/etcd-amd64@sha256:d83d3545e06fb035db8512e33bd44afb55dea007a3abd7b17742d3ac6d235940
and a few of the blobs came back, I did it for all images and ctr -n k8s.io images ls
is now clean and happy again, cri-containerd
seems to still be stuck in the bad state.
As an experiment I tried:
diff --git a/metadata/db.go b/metadata/db.go
index 510d14a2..34e3d35b 100644
--- a/metadata/db.go
+++ b/metadata/db.go
@@ -198,6 +198,7 @@ func (m *DB) GarbageCollect(ctx context.Context) error {
log.G(ctx).WithField("d", time.Now().Sub(lt1)).Debug("metadata garbage collected")
}()
+ return nil
var marked map[gc.Node]struct{}
if err := m.db.View(func(tx *bolt.Tx) error {
and the issue was resolved (I could init and reboot without issues)
I didn’t manage to trigger the issue with a local start, pull, stop, restart so I think it must be more complex to trigger. Happy to give pointers on the LinuxKit kube stuff if that seems like the best repro.
/cc @dmcgowan
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 20 (20 by maintainers)
@Random-Liu cri-containerd HEAD is not working either but also still missing this commit. I also tried the containerd v1.0.0-beta.2 only disk with removing/adding containers, images, tasks and couldnt reproduce.