kubernetes: Delete StatefulSet and re-create immediately, pod "ContainerCreating" because of detach/attach fail
What happened: I found statefulset cannot be recreate successfully in our cluster when the pods were scheduled on the same node.
So I test in my cluster, When I use 1 node cluster. I create statefulset by this yaml:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
- name: www2
mountPath: /mnt
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
volumeClaimTemplates:
- metadata:
creationTimestamp: null
name: www
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
storageClassName: cbs
- metadata:
creationTimestamp: null
name: www2
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
storageClassName: cbs
I created this statefulset. After all pods are running and delete it. Then create it immediately.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible): Like Anything else we need to know?: kubelet logs:
-- Logs begin at Tue 2018-12-18 10:55:32 CST. --
Dec 19 10:26:12 VM-0-10-ubuntu kubelet[1263]: E1219 10:26:12.512035 1263 nestedpendingoperations.go:264] Operation for "\"kubernetes.io/qcloud-cbs/disk-83vrk4qu\"" failed. No retries permitted until 2018-12-19 10:26:44.51200496 +0800 CST (durationBeforeRetry 32s). Error: Volume not attached according to node status for volume "pvc-e6f8e924-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-83vrk4qu") pod "web-0" (UID: "6107fdfb-0335-11e9-9f33-5254000a6bf1")
Dec 19 10:26:33 VM-0-10-ubuntu kubelet[1263]: I1219 10:26:33.863387 1263 server.go:779] GET /healthz: (15.459µs) 200 [[curl/7.47.0] 127.0.0.1:40514]
Dec 19 10:26:44 VM-0-10-ubuntu kubelet[1263]: I1219 10:26:44.566638 1263 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-e6f8e924-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-83vrk4qu") pod "web-0" (UID: "6107fdfb-0335-11e9-9f33-5254000a6bf1")
Dec 19 10:26:44 VM-0-10-ubuntu kubelet[1263]: I1219 10:26:44.566701 1263 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-412b3b8d-0280-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-3xvpivsy") pod "web-0" (UID: "6107fdfb-0335-11e9-9f33-5254000a6bf1")
Dec 19 10:26:44 VM-0-10-ubuntu kubelet[1263]: E1219 10:26:44.571522 1263 nestedpendingoperations.go:264] Operation for "\"kubernetes.io/qcloud-cbs/disk-3xvpivsy\"" failed. No retries permitted until 2018-12-19 10:27:48.571497481 +0800 CST (durationBeforeRetry 1m4s). Error: Volume not attached according to node status for volume "pvc-412b3b8d-0280-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-3xvpivsy") pod "web-0" (UID: "6107fdfb-0335-11e9-9f33-5254000a6bf1")
Dec 19 10:26:44 VM-0-10-ubuntu kubelet[1263]: E1219 10:26:44.571680 1263 nestedpendingoperations.go:264] Operation for "\"kubernetes.io/qcloud-cbs/disk-83vrk4qu\"" failed. No retries permitted until 2018-12-19 10:27:48.571661193 +0800 CST (durationBeforeRetry 1m4s). Error: Volume not attached according to node status for volume "pvc-e6f8e924-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-83vrk4qu") pod "web-0" (UID: "6107fdfb-0335-11e9-9f33-5254000a6bf1")
Dec 19 10:27:03 VM-0-10-ubuntu kubelet[1263]: I1219 10:27:03.869685 1263 server.go:779] GET /healthz: (16.66µs) 200 [[curl/7.47.0] 127.0.0.1:40590]
Dec 19 10:27:12 VM-0-10-ubuntu kubelet[1263]: I1219 10:27:12.400031 1263 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration
Dec 19 10:27:33 VM-0-10-ubuntu kubelet[1263]: I1219 10:27:33.471777 1263 fs.go:603] killing cmd [find /var/lib/docker/aufs/diff/f65500eaf2b7d0abc7e9e511640d7ad8cc43b3c151894c0ee3a877a4554a2287 -xdev -printf .] due to timeout(2m0s)
Dec 19 10:27:33 VM-0-10-ubuntu kubelet[1263]: I1219 10:27:33.875472 1263 server.go:779] GET /healthz: (17.043µs) 200 [[curl/7.47.0] 127.0.0.1:40690]
Dec 19 10:27:43 VM-0-10-ubuntu kubelet[1263]: E1219 10:27:43.764296 1263 kubelet.go:1628] Unable to mount volumes for pod "web-0_default(6107fdfb-0335-11e9-9f33-5254000a6bf1)": timeout expired waiting for volumes to attach/mount for pod "default"/"web-0". list of unattached/unmounted volumes=[www www2]; skipping pod
Dec 19 10:27:43 VM-0-10-ubuntu kubelet[1263]: E1219 10:27:43.764458 1263 pod_workers.go:182] Error syncing pod 6107fdfb-0335-11e9-9f33-5254000a6bf1 ("web-0_default(6107fdfb-0335-11e9-9f33-5254000a6bf1)"), skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"web-0". list of unattached/unmounted volumes=[www www2]
Dec 19 10:27:48 VM-0-10-ubuntu kubelet[1263]: I1219 10:27:48.610555 1263 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-412b3b8d-0280-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-3xvpivsy") pod "web-0" (UID: "6107fdfb-0335-11e9-9f33-5254000a6bf1")
Dec 19 10:27:48 VM-0-10-ubuntu kubelet[1263]: I1219 10:27:48.610596 1263 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-e6f8e924-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-83vrk4qu") pod "web-0" (UID: "6107fdfb-0335-11e9-9f33-5254000a6bf1")
Dec 19 10:27:48 VM-0-10-ubuntu kubelet[1263]: E1219 10:27:48.615388 1263 nestedpendingoperations.go:264] Operation for "\"kubernetes.io/qcloud-cbs/disk-83vrk4qu\"" failed. No retries permitted until 2018-12-19 10:29:50.615369456 +0800 CST (durationBeforeRetry 2m2s). Error: Volume not attached according to node status for volume "pvc-e6f8e924-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-83vrk4qu") pod "web-0" (UID: "6107fdfb-0335-11e9-9f33-5254000a6bf1")
Dec 19 10:27:48 VM-0-10-ubuntu kubelet[1263]: E1219 10:27:48.615693 1263 nestedpendingoperations.go:264] Operation for "\"kubernetes.io/qcloud-cbs/disk-3xvpivsy\"" failed. No retries permitted until 2018-12-19 10:29:50.615678746 +0800 CST (durationBeforeRetry 2m2s). Error: Volume not attached according to node status for volume "pvc-412b3b8d-0280-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-3xvpivsy") pod "web-0" (UID: "6107fdfb-0335-11e9-9f33-5254000a6bf1")
some controller logs
[root@VM_128_5_centos ~]# journalctl -r -u kube-controller-manager | grep reconciler.go | head -20
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.861740 20402 reconciler.go:231] attacherDetacher.DetachVolume started for volume "pvc-412b3b8d-0280-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-3xvpivsy") on node "10.0.0.10"
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.861700 20402 reconciler.go:226] Starting attacherDetacher.DetachVolume for volume "pvc-412b3b8d-0280-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-3xvpivsy") on node "10.0.0.10"
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.861672 20402 reconciler.go:213] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/qcloud-cbs/disk-3xvpivsy" from node "10.0.0.10" with: volume "kubernetes.io/qcloud-cbs/disk-3xvpivsy" does not exist in volumesToReportAsAttached list or node "10.0.0.10" does not exist in nodesToUpdateStatusFor list
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.861655 20402 reconciler.go:231] attacherDetacher.DetachVolume started for volume "pvc-e6f8e924-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-83vrk4qu") on node "10.0.0.10"
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.861604 20402 reconciler.go:226] Starting attacherDetacher.DetachVolume for volume "pvc-e6f8e924-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-83vrk4qu") on node "10.0.0.10"
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.861546 20402 reconciler.go:213] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/qcloud-cbs/disk-83vrk4qu" from node "10.0.0.10" with: volume "kubernetes.io/qcloud-cbs/disk-83vrk4qu" does not exist in volumesToReportAsAttached list or node "10.0.0.10" does not exist in nodesToUpdateStatusFor list
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.761372 20402 reconciler.go:226] Starting attacherDetacher.DetachVolume for volume "pvc-e6f8e924-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-83vrk4qu") on node "10.0.0.10"
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.761361 20402 reconciler.go:213] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/qcloud-cbs/disk-83vrk4qu" from node "10.0.0.10" with: volume "kubernetes.io/qcloud-cbs/disk-83vrk4qu" does not exist in volumesToReportAsAttached list or node "10.0.0.10" does not exist in nodesToUpdateStatusFor list
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.761350 20402 reconciler.go:231] attacherDetacher.DetachVolume started for volume "pvc-f8d6fb09-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-2dl526jy") on node "10.0.0.10"
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.761334 20402 reconciler.go:226] Starting attacherDetacher.DetachVolume for volume "pvc-f8d6fb09-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-2dl526jy") on node "10.0.0.10"
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.761320 20402 reconciler.go:213] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/qcloud-cbs/disk-2dl526jy" from node "10.0.0.10" with: volume "kubernetes.io/qcloud-cbs/disk-2dl526jy" does not exist in volumesToReportAsAttached list or node "10.0.0.10" does not exist in nodesToUpdateStatusFor list
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.761284 20402 reconciler.go:226] Starting attacherDetacher.DetachVolume for volume "pvc-412b3b8d-0280-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-3xvpivsy") on node "10.0.0.10"
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.761237 20402 reconciler.go:213] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/qcloud-cbs/disk-3xvpivsy" from node "10.0.0.10" with: volume "kubernetes.io/qcloud-cbs/disk-3xvpivsy" does not exist in volumesToReportAsAttached list or node "10.0.0.10" does not exist in nodesToUpdateStatusFor list
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.661091 20402 reconciler.go:226] Starting attacherDetacher.DetachVolume for volume "pvc-f8d6fb09-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-2dl526jy") on node "10.0.0.10"
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.661074 20402 reconciler.go:213] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/qcloud-cbs/disk-2dl526jy" from node "10.0.0.10" with: volume "kubernetes.io/qcloud-cbs/disk-2dl526jy" does not exist in volumesToReportAsAttached list or node "10.0.0.10" does not exist in nodesToUpdateStatusFor list
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.661052 20402 reconciler.go:226] Starting attacherDetacher.DetachVolume for volume "pvc-412b3b8d-0280-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-3xvpivsy") on node "10.0.0.10"
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.661031 20402 reconciler.go:213] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/qcloud-cbs/disk-3xvpivsy" from node "10.0.0.10" with: volume "kubernetes.io/qcloud-cbs/disk-3xvpivsy" does not exist in volumesToReportAsAttached list or node "10.0.0.10" does not exist in nodesToUpdateStatusFor list
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.660992 20402 reconciler.go:226] Starting attacherDetacher.DetachVolume for volume "pvc-e6f8e924-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-83vrk4qu") on node "10.0.0.10"
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.660945 20402 reconciler.go:213] RemoveVolumeFromReportAsAttached failed while removing volume "kubernetes.io/qcloud-cbs/disk-83vrk4qu" from node "10.0.0.10" with: volume "kubernetes.io/qcloud-cbs/disk-83vrk4qu" does not exist in volumesToReportAsAttached list or node "10.0.0.10" does not exist in nodesToUpdateStatusFor list
Dec 19 10:25:38 VM_128_5_centos kube-controller-manager[20402]: I1219 10:25:38.560798 20402 reconciler.go:226] Starting attacherDetacher.DetachVolume for volume "pvc-e6f8e924-0272-11e9-80d6-5254000a6bf1" (UniqueName: "kubernetes.io/qcloud-cbs/disk-83vrk4qu") on node "10.0.0.10"
node yaml:
nodeInfo:
architecture: amd64
bootID: 68cf830e-bcfc-4d9e-b4b7-ec47dfaebbd9
containerRuntimeVersion: docker://17.12.1-ce
kernelVersion: 4.4.0-104-generic
kubeProxyVersion: v1.8.13-tke.4
kubeletVersion: v1.8.13-tke.4
machineID: c5f311a0edd169fa9fb1933c58105309
operatingSystem: linux
osImage: Ubuntu 16.04.1 LTS
systemUUID: AE5BF58D-32B0-4960-8D93-00AE76C3CD48
volumesInUse:
- kubernetes.io/qcloud-cbs/disk-3xvpivsy
- kubernetes.io/qcloud-cbs/disk-83vrk4qu
Only volumesInUse
in the node. No volumesAttached
ubuntu@VM-0-10-ubuntu:~$ kubectl get pod
NAME READY STATUS RESTARTS AGE
web-0 0/1 ContainerCreating 0 22m
Environment:
- Kubernetes version (use
kubectl version
): 1.10.5 - Cloud provider or hardware configuration: Tecent
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): Linux VM-0-10-ubuntu 4.4.0-104-generic #127-Ubuntu SMP Mon Dec 11 12:16:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux - Install tools:
- Others:
/kind bug
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 42 (19 by maintainers)
@Nebulazhang Thanks a lot for the debugging information. As you mentioned, I think there is a race condition. Before reconciler starts a go routine to detach the volume, it mark volume as not attached (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L207). But the problem is that if the operation could not be started due to there is another one pending or there is exponential backoff error, the volume does not has a chance to be added back.
So I think the fix could be add volume as attached if the DetachVolume operation fails to start. Will work on a PR to fix it.