kubernetes: Problem rescheduling POD with GCE PD disk attached
Hello,
I’m using a GKE (1.0.6) cluster. Today, for a yet unknown reason, a node has rebooted. This node used to have a pod with a GCE PD attached. This pod is scheduled by a RC with only one replica. When the node rebooted, the pod has been rescheduled on an other node. However for some reason, the PD has not been detached from the old node. The result was that Kubernetes tried multiple time to attach the disk to the new node. I got a lot of errors in GCE Operations Dashboard:
RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The disk resource 'projects/projectid/zones/europe-west1-c/disks/diskname' is already being used by 'projects/projectid/zones/europe-west1-c/instances/gke-nodename'
At the end, the pod is in Waiting state with that reason:
Image: gcr.io/projectid/imagename:imagetag is not ready on the node
(which is not IMHO the right error message)
And events:
Mon, 28 Sep 2015 07:09:34 +0200 Mon, 28 Sep 2015 10:50:26 +0200 124 {kubelet nodename} failedMount Unable to mount volumes for pod "podname": Could not attach GCE PD "diskname". Timeout waiting for mount paths to be created.
Mon, 28 Sep 2015 07:09:34 +0200 Mon, 28 Sep 2015 10:50:26 +0200 124 {kubelet nodename} failedSync Error syncing pod, skipping: Could not attach GCE PD "diskname". Timeout waiting for mount paths to be created.
As this a not critical service, I’m happy to let it for a few days in this state, if that can help for debugging. Is there any other thing I could provide to help understand the problem ?
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Reactions: 3
- Comments: 37 (25 by maintainers)
+1
gcloud compute instances detach-disk NODE --disk DISK
It is indeed my friend https://github.com/kubernetes/kubernetes/issues/25457 Hang tight =)
We’re working on a fix. It is targeted for the next minor release v1.3.0.
@ScyDev That’s https://github.com/kubernetes/kubernetes/issues/19953 Should be fixed at the same time as this.
I think i ran into the same issue trying two deploy a pod with a GCE PD and a replication controller set to replicas=2.
Regarding the documentation that should be possible:
So i setup a new cluster with one minion and Kubernetes 1.2.2. Then i created a PD, PersistentVolumes and my ReplicationController:
apiVersion: v1 kind: PersistentVolume metadata: name: volume-nginx-data-disk spec: capacity: storage: 1Gi accessModes: - ReadOnlyMany gcePersistentDisk: pdName: nginx-data-disk fsType: ext4
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: claim-nginx-data-disk spec: accessModes: - ReadOnlyMany resources: requests: storage: 1Gi
apiVersion: v1 kind: ReplicationController metadata: name: sslproxy-rc labels: name: sslproxy-rc spec: replicas: 2 selector: name: sslproxy-rc template: metadata: labels: name: sslproxy-rc spec: containers: - name: nginx image: nginx ports: - name: nginx-ssl containerPort: 443 volumeMounts: - name: nginx-data-disk mountPath: /etc/nginx/conf.d volumes: - name: nginx-data-disk persistentVolumeClaim: claimName: claim-nginx-data-disk readOnly: true
So everything should be mounted in “readOnly” mode. When i try to deploy my RC, only the first pod is created, the second is always stuck with status “Pending”.
kubectl describe pod xyz:
Unable to mount volumes for pod “sslproxy-rc-zxe52_default(a532eb8f-07f2-11e6-8707-42010af00108)”: Could not attach GCE PD “nginx-data-disk”. Timeout waiting for mount paths to be created.
{kubelet gke-sslproxy-cluster1-default-pool-4091b8f3-wys3} Warning FailedSync Error syncing pod, skipping: Could not attach GCE PD “nginx-data-disk”. Timeout waiting for mount paths to be created.
kubelet.log:
GCE operation failed: googleapi: Error 400: The disk resource ‘nginx-data-disk’ is already being used by ‘gke-sslproxy-cluster1-default-pool-4091b8f3-wys3’ gce_util.go:187] Error attaching PD “nginx-data-disk”: googleapi: Error 400: The disk resource ‘nginx-data-disk’ is already being used by ‘gke-sslproxy-cluster1-default-pool-4091b8f3-wys3’
I think that is because of (https://github.com/kubernetes/kubernetes/issues/21931)
Kilian
@saad-ali im my case the node come back up and kubelet did not detach the PD even if the pod was rescheduled in an other node. I had to detach it manually.
While we work on a permanent fix, for folks running into this issue, here are a couple workarounds: