kubernetes: Panic in kubelet Run does not exit, creates multiple parallel Run goroutines

If the kubelet Run method panics (e.g. https://github.com/kubernetes/kubernetes/pull/88915) Run gets called multiple times:

https://github.com/kubernetes/kubernetes/blob/cb3856042212fcc85ea12c95e5d17ab7f6c286c8/cmd/kubelet/app/server.go#L1131-L1134

That would start multiple goroutines calling relist() concurrently

https://github.com/kubernetes/kubernetes/blob/cb3856042212fcc85ea12c95e5d17ab7f6c286c8/pkg/kubelet/kubelet.go#L1450-L1451

https://github.com/kubernetes/kubernetes/blob/cb3856042212fcc85ea12c95e5d17ab7f6c286c8/pkg/kubelet/pleg/generic.go#L129-L132

From a PR that doesn’t touch the PLEG (but might stress it?), panic in kubelet from pleg:

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/88440/pull-kubernetes-e2e-kind/1234952132935815168

Mar 03 19:21:03 kind-worker2 kubelet[88624]: fatal error: concurrent map read and map write
Mar 03 19:21:03 kind-worker2 kubelet[88624]: goroutine 1420 [running]:
Mar 03 19:21:03 kind-worker2 kubelet[88624]: runtime.throw(0x43641ed, 0x21)
Mar 03 19:21:03 kind-worker2 kubelet[88624]:         GOROOT/src/runtime/panic.go:774 +0x72 fp=0xc0011f57e8 sp=0xc0011f57b8 pc=0x432682
Mar 03 19:21:03 kind-worker2 kubelet[88624]: runtime.mapaccess2_faststr(0x3ddf9e0, 0xc000706690, 0xc00084f1a0, 0x24, 0xc0004b93c8, 0x1)
Mar 03 19:21:03 kind-worker2 kubelet[88624]:         GOROOT/src/runtime/map_faststr.go:116 +0x48f fp=0xc0011f5858 sp=0xc0011f57e8 pc=0x4160cf
Mar 03 19:21:03 kind-worker2 kubelet[88624]: k8s.io/kubernetes/pkg/kubelet/pleg.podRecords.setCurrent(0xc000706690, 0xc001e44600, 0x1d, 0x20)
Mar 03 19:21:03 kind-worker2 kubelet[88624]:         pkg/kubelet/pleg/generic.go:464 +0x149 fp=0xc0011f5910 sp=0xc0011f5858 pc=0x3318999
Mar 03 19:21:03 kind-worker2 kubelet[88624]: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc000788960)
Mar 03 19:21:03 kind-worker2 kubelet[88624]:         pkg/kubelet/pleg/generic.go:214 +0x2c9 fp=0xc0011f5e20 sp=0xc0011f5910 pc=0x3316a89
Mar 03 19:21:03 kind-worker2 kubelet[88624]: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist-fm()
Mar 03 19:21:03 kind-worker2 kubelet[88624]:         pkg/kubelet/pleg/generic.go:190 +0x2a fp=0xc0011f5e38 sp=0xc0011f5e20 pc=0x3318cca
Mar 03 19:21:03 kind-worker2 kubelet[88624]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00016fbc0

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/88440/pull-kubernetes-e2e-kind/1234916134344462336/artifacts/logs/kind-worker2/kubelet.log

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 26 (26 by maintainers)

Most upvoted comments

Working on a patch for option #1 now.

looks like the particular panic was an actual bug in that PR (ContainerStatuses indexed in a range of InitContainerStatuses): https://github.com/kubernetes/kubernetes/compare/782bf3341b7c9a60d151c237626d654b08976b91..5719d3138c3e3c27b3b511a8754343594fa95b46#diff-ea8c11a933d6d6c22c0c12b6a38c4b46R338

the bug was fixed before merge

Looking at the logs there are 3 fatal errors that occur: concurrent map iteration and map write, concurrent map read and map write (2x), and fatal error: concurrent map writes. All are caused by concurrent relist() calls.

$ grep -e "fatal error" -e "relist(" kubelet.log 
Mar 03 19:18:06 kind-worker2 kubelet[670]: fatal error: concurrent map read and map write
Mar 03 19:18:06 kind-worker2 kubelet[670]: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc000c23aa0)
Mar 03 19:18:07 kind-worker2 kubelet[670]: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc000c23aa0)
Mar 03 19:21:03 kind-worker2 kubelet[88624]: fatal error: concurrent map read and map write
Mar 03 19:21:03 kind-worker2 kubelet[88624]: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc000788960)
Mar 03 19:21:03 kind-worker2 kubelet[88624]: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc000788960)
Mar 03 19:21:55 kind-worker2 kubelet[107570]: fatal error: concurrent map writes
Mar 03 19:21:55 kind-worker2 kubelet[107570]: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc0002a1e00)
Mar 03 19:21:55 kind-worker2 kubelet[107570]: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc0002a1e00)
Mar 03 19:27:26 kind-worker2 kubelet[111302]: fatal error: concurrent map iteration and map write
Mar 03 19:27:26 kind-worker2 kubelet[111302]: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc0005dd500)
Mar 03 19:27:26 kind-worker2 kubelet[111302]: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc0005dd500)