containerd: Containers failing to terminate and complete in containerd 1.1.4 and 1.2
The main issue is simply that containers will stay stuck in Terminating or in the case of Init
containers in kubernetes complete but just stay in the Init
phase. What that causes is the pod will never transition to it’s main container when that occurs (basically as if the init container ran forever). For the Terminating
problem the pod just hangs around till it is force deleted.
An example of a pod that was stuck in terminating
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: I1018 13:06:04.146167 1747 kubelet.go:1853] SyncLoop (DELETE, "api"): "hubperf-0_default(e6166efd-d2b0-11e8-b9e2-261fec5456d5)"
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: I1018 13:06:04.146415 1747 kuberuntime_container.go:553] Killing container "containerd://fb3c811af15cf95ed631baca4b07960c223994449584f75468e80bf1b4f012f4" with 30 second grace period
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 systemd[1]: Starting Logrotate Cronjob...
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 systemd[1]: Started Logrotate Cronjob.
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: E1018 13:06:04.373608 1747 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 10.185.18.132:31409->10.185.18.132:10010: write tcp 10.185.18.132:31409->10.185.18.132:10010: write: broken pipe
Oct 18 13:06:38 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: I1018 13:06:38.523728 1747 kubelet.go:1853] SyncLoop (DELETE, "api"): "hubperf-0_default(e6166efd-d2b0-11e8-b9e2-261fec5456d5)"
It looks like potentially the kubelet can have problems talking to containerd that might result in this?
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: E1018 13:06:04.373608 1747 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 10.185.18.132:31409->10.185.18.132:10010: write tcp 10.185.18.132:31409->10.185.18.132:10010: write: broken pipe
The other we saw was an init container run fully but stay stuck in a Init:0/1
. I suspect containerd never fully terminated the init container even though the container exited and therefore the pod stayed in this sate.
kube-system ibm-master-proxy-kcq4x 0/1 Init:0/1 0 15h 10.116.131.16 10.116.131.16 <none>
Yaml for the pod is the following
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"k8s-app":"ibm-master-proxy"},"name":"ibm-master-proxy","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"k8s-app":"ibm-master-proxy"}},"template":{"metadata":{"annotations":{"scheduler.alpha.kubernetes.io/critical-pod":"","scheduler.alpha.kubernetes.io/tolerations":"[{\"key\":\"CriticalAddonsOnly\", \"operator\":\"Exists\"}]"},"labels":{"k8s-app":"ibm-master-proxy"}},"spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"ibm-cloud.kubernetes.io/ha-worker","operator":"DoesNotExist"}]}]}}},"containers":[{"command":["/bin/sh","-c","mkdir /cache \u0026\u0026 cp /scripts/applyinterfaces.sh /cache \u0026\u0026 chmod +x /cache/applyinterfaces.sh \u0026\u0026 /cache/applyinterfaces.sh \u0026\u0026 /sbin/syslogd -O /proc/1/fd/1 \u0026\u0026 haproxy -f /usr/local/etc/haproxy/haproxy.cfg -V -dR"],"image":"registry.ng.bluemix.net/armada-master/haproxy:1.8.12-alpine","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":6,"httpGet":{"host":"172.20.0.1","path":"/healthz","port":2040,"scheme":"HTTPS"},"initialDelaySeconds":10,"periodSeconds":60,"successThreshold":1,"timeoutSeconds":10},"name":"ibm-master-proxy","ports":[{"containerPort":2040,"hostPort":2040,"name":"apiserver","protocol":"TCP"},{"containerPort":2041,"hostPort":2041,"name":"etcd","protocol":"TCP"}],"resources":{"limits":{"cpu":"300m","memory":"512M"},"requests":{"cpu":"25m","memory":"32M"}},"securityContext":{"privileged":true},"volumeMounts":[{"mountPath":"/usr/local/etc/haproxy","name":"etc-config","readOnly":true},{"mountPath":"/scripts","name":"ibm-network-interfaces"},{"mountPath":"/host","name":"host-path"}]}],"hostNetwork":true,"priorityClassName":"system-cluster-critical","tolerations":[{"operator":"Exists"}],"volumes":[{"configMap":{"name":"ibm-master-proxy-config"},"name":"etc-config"},{"configMap":{"name":"ibm-network-interfaces"},"name":"ibm-network-interfaces"},{"hostPath":{"path":"/"},"name":"host-path"}]}}}}
creationTimestamp: 2018-10-24T18:26:10Z
generation: 1
labels:
k8s-app: ibm-master-proxy
name: ibm-master-proxy
namespace: kube-system
resourceVersion: "79560"
selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/ibm-master-proxy
uid: 47a5e5ae-d7ba-11e8-9128-ca4f5830ba32
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: ibm-master-proxy
template:
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly",
"operator":"Exists"}]'
creationTimestamp: null
labels:
k8s-app: ibm-master-proxy
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: ibm-cloud.kubernetes.io/ha-worker
operator: DoesNotExist
containers:
- command:
- /bin/sh
- -c
- mkdir /cache && cp /scripts/applyinterfaces.sh /cache && chmod +x /cache/applyinterfaces.sh
&& /cache/applyinterfaces.sh && /sbin/syslogd -O /proc/1/fd/1 && haproxy
-f /usr/local/etc/haproxy/haproxy.cfg -V -dR
image: registry.ng.bluemix.net/armada-master/haproxy:1.8.12-alpine
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 6
httpGet:
host: 172.20.0.1
path: /healthz
port: 2040
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 10
name: ibm-master-proxy
ports:
- containerPort: 2040
hostPort: 2040
name: apiserver
protocol: TCP
- containerPort: 2041
hostPort: 2041
name: etcd
protocol: TCP
resources:
limits:
cpu: 300m
memory: 512M
requests:
cpu: 25m
memory: 32M
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/local/etc/haproxy
name: etc-config
readOnly: true
- mountPath: /scripts
name: ibm-network-interfaces
- mountPath: /host
name: host-path
dnsPolicy: ClusterFirst
hostNetwork: true
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- operator: Exists
volumes:
- configMap:
defaultMode: 420
name: ibm-master-proxy-config
name: etc-config
- configMap:
defaultMode: 420
name: ibm-network-interfaces
name: ibm-network-interfaces
- hostPath:
path: /
type: ""
name: host-path
templateGeneration: 1
updateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 58 (22 by maintainers)
Commits related to this issue
- Use containerd-1.2.1 This is temporary change till a new docker-ce/containerd.io package versions comes out. There's critical bug in containerd 1.2.0: containerd/containerd#2744 — committed to ivan4th/kubeadm-dind-cluster by deleted user 6 years ago
- Use containerd-1.2.1 This is temporary change till a new docker-ce/containerd.io package versions comes out. There's critical bug in containerd 1.2.0: containerd/containerd#2744 — committed to ivan4th/kubeadm-dind-cluster by deleted user 6 years ago
I confirm that Virtlet works just fine after the changes with the pod terminating correctly when Virtlet is removed from a node. Verified with 1.2.1 release.
Thanks folks! Will try containerd master / release-1.2 with Virtlet when the changes land.
@Random-Liu wow…that was a pretty important and embarrassing miss in my work 😃 @relyt0925 I can generate a new build with that included so you can test again.
I’ll also create a cherrypick PR for
release/1.2
branch@estesp Have you cherry-picked https://github.com/containerd/containerd/pull/2743? That is one of the most important fixes.