kubernetes: Deadlock in PLEG relist() for health check and Kubelet syncLoop()

What happened:

In our stress testing, we found the deadlock issue in PLEG relist() for healthy checking and Kubelet syncLoop()

The goroutine stack trace is as following

goroutine 386 [chan send, 1140 minutes]:
k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc42069ea20)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:261 +0x74e
k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).(k8s.io/kubernetes/pkg/kubelet/pleg.relist)-fm()
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:130 +0x2a
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc4212ee520)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4212ee520, 0x3b9aca00, 0x0, 0x1, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc4212ee520, 0x3b9aca00, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).Start
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:130 +0x88
...
goroutine 309 [sleep]:
time.Sleep(0x12a05f200)
	/usr/local/go/src/runtime/time.go:102 +0x166
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncLoop(0xc4205e3b00, 0xc420ff2780, 0x3e56a60, 0xc4205e3b00)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go:1777 +0x1e7
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).Run(0xc4205e3b00, 0xc420ff2780)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go:1396 +0x27f
k8s.io/kubernetes/cmd/kubelet/app.startKubelet.func1()
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:998 +0x67
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42105dfb0)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc42105dfb0, 0x0, 0x0, 0x1, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42105dfb0, 0x0, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by k8s.io/kubernetes/cmd/kubelet/app.startKubelet
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:996 +0xea
...

We did some investigation, and found the following issue.

...
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
...
	for {
		if rs := kl.runtimeState.runtimeErrors(); len(rs) != 0 {
			glog.Infof("skipping pod synchronization - %v", rs)
			// exponential backoff
			time.Sleep(duration)
			duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
			continue
		}
		// reset backoff if we have a success
		duration = base

		kl.syncLoopMonitor.Store(kl.clock.Now())
		if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
			break
		}
		kl.syncLoopMonitor.Store(kl.clock.Now())
	}
}

...
func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
	syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
	select {
         ...
	case e := <-plegCh:
		if isSyncPodWorthy(e) {
			// PLEG event for a pod; sync it.
			if pod, ok := kl.podManager.GetPodByUID(e.ID); ok {
				glog.V(2).Infof("SyncLoop (PLEG): %q, event: %#v", format.Pod(pod), e)
				handler.HandlePodSyncs([]*v1.Pod{pod})
			} else {
				// If the pod no longer exists, ignore the event.
				glog.V(4).Infof("SyncLoop (PLEG): ignore irrelevant event: %#v", e)
			}
		}

		if e.Type == pleg.ContainerDied {
			if containerID, ok := e.Data.(string); ok {
				kl.cleanUpContainersInPod(e.ID, containerID)
			}
		}
         ...
	}
	return true
}

The main loop for event processing in Kubelet will check if there are some runtimeErrors: if yes, wait until runtime back to normal; if no, it will call syncLoopIteration to read and process events.

And the runtime healthChecks depends on the GenericPLEG.relist. But in the relist() impl, it will send the pod lifecycle event to eventChannel which may introduce deadlock when eventChannel is full.

func (g *GenericPLEG) relist() {
        ...
	// If there are events associated with a pod, we should update the
	// podCache.
	for pid, events := range eventsByPodID {
                 ... 
		for i := range events {
			// Filter out events that are not reliable and no other components use yet.
			if events[i].Type == ContainerChanged {
				continue
			}
			g.eventChannel <- events[i]
		}
	}

	... 
}

And in the above code, the eventChannel will not be consumed if the runtime is not healthy. In some situation, the eventChannel may full, and the relist() will hang in sending new pod lifecycle event. And the runtime status will not be able to recovered.

It is related to issues #45419

What you expected to happen:

No such issue

How to reproduce it (as minimally and precisely as possible):

It is not easy to reproduce, it happens after few days stress testing (scaling the deployment repeatedly with high load).

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.11.5
Cloud provider or hardware configuration: Bare metal
OS (e.g. from /etc/os-release): CentOS 7.4
Kernel (e.g. uname -a): Linux bjc-qtt-k8s-node-010 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubeadm
Others:

/kind bug

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 17 (17 by maintainers)

Most upvoted comments

@Random-Liu @yujuhong Sure, our product team (@denverdino ) will input the stress test data after removing sensitive information in follow up comments.

resouer on Jan 15, 2019

Then we create a job frequently create/delete 20 pod every 5 seconds.

@BSWANG Yeah, I can see how frequent container state change can cause this issue. The number is very useful, thanks a lot!

Random-Liu on Jan 16, 2019

@resouer @Random-Liu @yujuhong Thanks for reply. Following is our testcase to reproduce this issue: The test cluster has 5 node on aliyun ECS(like EC2). Three 4C8G machine as master node, Two 96C384G as worker node. Firstly, We create 2 deployment running on the worker node. One deployment is http server pods. Another is siege pods to produce network stress to http server deployment. Those stress test deployments produce 50 CPU Core load on worker node. Then we create a job frequently create/delete 20 pod every 5 seconds. After about 1 day of stess test, The worker node will become NotReady. Dig the pprof goroutine stacktrace found the channel block in relist.

BSWANG on Jan 16, 2019

When pleg do relist, can we invoke kl.runtimeState.runtimeErrors func to judge runtime is ok, if not, not do relist

changyaowei on Jan 8, 2019