containerd: Containers failing to terminate and complete in containerd 1.1.4 and 1.2

The main issue is simply that containers will stay stuck in Terminating or in the case of Init containers in kubernetes complete but just stay in the Init phase. What that causes is the pod will never transition to it’s main container when that occurs (basically as if the init container ran forever). For the Terminating problem the pod just hangs around till it is force deleted.

An example of a pod that was stuck in terminating

Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: I1018 13:06:04.146167    1747 kubelet.go:1853] SyncLoop (DELETE, "api"): "hubperf-0_default(e6166efd-d2b0-11e8-b9e2-261fec5456d5)"
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: I1018 13:06:04.146415    1747 kuberuntime_container.go:553] Killing container "containerd://fb3c811af15cf95ed631baca4b07960c223994449584f75468e80bf1b4f012f4" with 30 second grace period
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 systemd[1]: Starting Logrotate Cronjob...
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 systemd[1]: Started Logrotate Cronjob.
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: E1018 13:06:04.373608    1747 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 10.185.18.132:31409->10.185.18.132:10010: write tcp 10.185.18.132:31409->10.185.18.132:10010: write: broken pipe
Oct 18 13:06:38 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: I1018 13:06:38.523728    1747 kubelet.go:1853] SyncLoop (DELETE, "api"): "hubperf-0_default(e6166efd-d2b0-11e8-b9e2-261fec5456d5)"

It looks like potentially the kubelet can have problems talking to containerd that might result in this?

Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: E1018 13:06:04.373608    1747 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 10.185.18.132:31409->10.185.18.132:10010: write tcp 10.185.18.132:31409->10.185.18.132:10010: write: broken pipe

The other we saw was an init container run fully but stay stuck in a Init:0/1. I suspect containerd never fully terminated the init container even though the container exited and therefore the pod stayed in this sate.

kube-system     ibm-master-proxy-kcq4x                                            0/1       Init:0/1            0          15h       10.116.131.16    10.116.131.16   <none>

Yaml for the pod is the following

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"k8s-app":"ibm-master-proxy"},"name":"ibm-master-proxy","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"k8s-app":"ibm-master-proxy"}},"template":{"metadata":{"annotations":{"scheduler.alpha.kubernetes.io/critical-pod":"","scheduler.alpha.kubernetes.io/tolerations":"[{\"key\":\"CriticalAddonsOnly\", \"operator\":\"Exists\"}]"},"labels":{"k8s-app":"ibm-master-proxy"}},"spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"ibm-cloud.kubernetes.io/ha-worker","operator":"DoesNotExist"}]}]}}},"containers":[{"command":["/bin/sh","-c","mkdir /cache \u0026\u0026 cp /scripts/applyinterfaces.sh /cache \u0026\u0026 chmod +x /cache/applyinterfaces.sh \u0026\u0026 /cache/applyinterfaces.sh \u0026\u0026 /sbin/syslogd -O /proc/1/fd/1  \u0026\u0026 haproxy -f /usr/local/etc/haproxy/haproxy.cfg -V -dR"],"image":"registry.ng.bluemix.net/armada-master/haproxy:1.8.12-alpine","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":6,"httpGet":{"host":"172.20.0.1","path":"/healthz","port":2040,"scheme":"HTTPS"},"initialDelaySeconds":10,"periodSeconds":60,"successThreshold":1,"timeoutSeconds":10},"name":"ibm-master-proxy","ports":[{"containerPort":2040,"hostPort":2040,"name":"apiserver","protocol":"TCP"},{"containerPort":2041,"hostPort":2041,"name":"etcd","protocol":"TCP"}],"resources":{"limits":{"cpu":"300m","memory":"512M"},"requests":{"cpu":"25m","memory":"32M"}},"securityContext":{"privileged":true},"volumeMounts":[{"mountPath":"/usr/local/etc/haproxy","name":"etc-config","readOnly":true},{"mountPath":"/scripts","name":"ibm-network-interfaces"},{"mountPath":"/host","name":"host-path"}]}],"hostNetwork":true,"priorityClassName":"system-cluster-critical","tolerations":[{"operator":"Exists"}],"volumes":[{"configMap":{"name":"ibm-master-proxy-config"},"name":"etc-config"},{"configMap":{"name":"ibm-network-interfaces"},"name":"ibm-network-interfaces"},{"hostPath":{"path":"/"},"name":"host-path"}]}}}}
  creationTimestamp: 2018-10-24T18:26:10Z
  generation: 1
  labels:
    k8s-app: ibm-master-proxy
  name: ibm-master-proxy
  namespace: kube-system
  resourceVersion: "79560"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/ibm-master-proxy
  uid: 47a5e5ae-d7ba-11e8-9128-ca4f5830ba32
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: ibm-master-proxy
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
        scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly",
          "operator":"Exists"}]'
      creationTimestamp: null
      labels:
        k8s-app: ibm-master-proxy
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: ibm-cloud.kubernetes.io/ha-worker
                operator: DoesNotExist
      containers:
      - command:
        - /bin/sh
        - -c
        - mkdir /cache && cp /scripts/applyinterfaces.sh /cache && chmod +x /cache/applyinterfaces.sh
          && /cache/applyinterfaces.sh && /sbin/syslogd -O /proc/1/fd/1  && haproxy
          -f /usr/local/etc/haproxy/haproxy.cfg -V -dR
        image: registry.ng.bluemix.net/armada-master/haproxy:1.8.12-alpine
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 6
          httpGet:
            host: 172.20.0.1
            path: /healthz
            port: 2040
            scheme: HTTPS
          initialDelaySeconds: 10
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 10
        name: ibm-master-proxy
        ports:
        - containerPort: 2040
          hostPort: 2040
          name: apiserver
          protocol: TCP
        - containerPort: 2041
          hostPort: 2041
          name: etcd
          protocol: TCP
        resources:
          limits:
            cpu: 300m
            memory: 512M
          requests:
            cpu: 25m
            memory: 32M
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/local/etc/haproxy
          name: etc-config
          readOnly: true
        - mountPath: /scripts
          name: ibm-network-interfaces
        - mountPath: /host
          name: host-path
      dnsPolicy: ClusterFirst
      hostNetwork: true
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - operator: Exists
      volumes:
      - configMap:
          defaultMode: 420
          name: ibm-master-proxy-config
        name: etc-config
      - configMap:
          defaultMode: 420
          name: ibm-network-interfaces
        name: ibm-network-interfaces
      - hostPath:
          path: /
          type: ""
        name: host-path
  templateGeneration: 1
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 58 (22 by maintainers)

Commits related to this issue

Most upvoted comments

I confirm that Virtlet works just fine after the changes with the pod terminating correctly when Virtlet is removed from a node. Verified with 1.2.1 release.

Thanks folks! Will try containerd master / release-1.2 with Virtlet when the changes land.

@Random-Liu wow…that was a pretty important and embarrassing miss in my work 😃 @relyt0925 I can generate a new build with that included so you can test again.

I’ll also create a cherrypick PR for release/1.2 branch

@estesp Have you cherry-picked https://github.com/containerd/containerd/pull/2743? That is one of the most important fixes.