kubernetes: Deployments with GCE PD fail with "...is already being used by..."

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

From user report on reddit:

I have a deployment with persistent volume claim in Google cloud. One pod is using this volume. Deployment is of "recreate" type. But each time node is feeling under the weather and reschedules this pod to another one, it fails to start with:

googleapi: Error 400: The disk resource 'projects/...-pvc-...' is already being used by 'projects/.../instances/..node-bla-bla'     

I've stumbled across some issues on github, but don't see definitive solution. Due to the nature of the problem, I cannot reliably recreate it manually, artificial overload needs to be created.

What I considered doing:
1. Create some sort of gluster/ceph/whateverfs cluster and using it as PV. Con: additional point of failure, needs setup/maintenance of its own.
2. Create separate node pool with 1 node in it and schedule deployment strictly to that pool. Con: doesn't scale neither up nor down, at this point no need in one whole node just for that deployment, but if it grows then problem starts all over.

I've upgraded cluster and nodes to 1.6.7, but don't know if it will matter. Any help appreciated.

Other reports here:

What you expected to happen: Volume should attach to new node without issue.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration**:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 75 (43 by maintainers)

Most upvoted comments

This report mentioned Deployments along with GCE PDs. This can get tricky because in some cases it can result in multiple pods (scheduled to different nodes) referencing the same (read-write once) volume which will cause the second pod to not start.

To prevent this from happening, the general recommendations for using Deployments with GCE PDs is:

  • Set the deployment replicas count to 1 – Because GCE PDs can only support Read-write attachment to a single node at a time, and if you have more than 1 replica, pods may be scheduled to different nodes.
  • When doing rolling updates either:
    1. Use the “Recreate” strategy, which ensures that old pods are killed before new pods are created (there was a bug https://github.com/kubernetes/kubernetes/issues/27362 where this doesn’t work correctly in some cases that apperently was fixed a long time ago)
    2. Use the “RollingUpdate” strategy with MaxSurge=0 and MaxUnavailable=1.
      • If a strategy is not specified for a deployment, the default is RollingUpdate. Rolling update strategy has two parameters maxUnavailable and maxSurge; when not specified they default to 1 and 1 respectively. This means that during a rolling update it requires at least one pod from the old deployment to remain and permits an extra new pod (beyond the requested number of replicas) to be created. When this happens, if the new pod lands on a different node, since the old pod has the disk attached as read-write the new pod will fail to start.

However, the reporter mentioned they used the “Recreate” strategy, which means that there must be a bug here.

To help us debug, if you run into this issue, please:

  1. Verify that you are adhering guidance provided above.
  2. Grab and share the following with me (either post here or email directly if you don’t want to share publicly): * Your kube-controller-manager logs from your master (if you’re on GKE contact customer support, reference this issue, and ask them to grab the logs for you). * Your deployment YAML * A description of what commands your ran and when.

Let’s figure this out!

CC @kubernetes/sig-storage-bugs

I have looked into different failure scenarios with our recommended settings and have the following conclusions:

Note: This may not be an exhaustive list of failure scenarios. Please contact me if you are experiencing issues with a different scenario.

Note: All scenarios tested with “replicas:1” and deployment with a PVC referencing gce-pd (only supports single node attach).

No deployment strategy Deployment strategy: Recreate Deployment strategy: Rolling update (maxSurge: 0, maxUnavailable: 1)
Deleting pod manually Successful new pod Successful new pod Successful new pod
Updating deployment Expected Error: multi-attach Successful new pod Successful new pod
Tainting node to evict pods Successful new pod Successful new pod Successful new pod
Node is killed Successful new pod Successful new pod Successful new pod
Stressing node to evict pods https://github.com/kubernetes/kubernetes/issues/57531 https://github.com/kubernetes/kubernetes/issues/57531 https://github.com/kubernetes/kubernetes/issues/57531

I can confirm that following works:

  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0

While type: Recreate does not.

It still feels clumsy: it does not prevent the situation from happening, but if I understand right, just recreates pod until it gets on “right” node, where the volume was mounted. I can see pod fail after deployment due to volume being stuck with another node, but after a restart or two pod starts working.

We also see this issue happening. The most annoying thing about GCE is that it does “silent” migration of VMs, after which PVs are not released. So, when Kubernetes wants to start pod on that node, it fails with resource already in use type error.

cc: @sadlil

/remove-lifecycle rotten

Hi, we experienced the same problem with kubernetes 1.9.6:

Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“10”, GitVersion:“v1.10.0”, GitCommit:“fc32d2f3698e36b93322a3465f63a14e9f0eaead”, GitTreeState:“clean”, BuildDate:“2018-03-27T00:13:02Z”, GoVersion:“go1.9.4”, Compiler:“gc”, Platform:“darwin/amd64”}

Server Version: version.Info{Major:“1”, Minor:“9”, GitVersion:“v1.9.6”, GitCommit:“9f8ebd171479bec0ada837d7ee641dec2f8c6dd1”, GitTreeState:“clean”, BuildDate:“2018-03-21T15:13:31Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“linux/amd64”}

Cloud provider or hardware configuration: GCE

Some more details: We use a statefulset for our POD, configured with RollingUpdate strategy.

Pod was restarted from node1 to node2 and stuck in ContainerCreating state due to

AttachVolume.Attach failed for volume “pvc-id” : googleapi: Error 400: The disk resource ‘projects/blabla/zones/blablazone/disks/gcp-dynam-pvc-id’ is already being used by ‘projects/blabla/zones/blablazone/instances/node1’

controller-manager logs:

I0515 18:04:22.207470       1 reconciler.go:287] attacherDetacher.AttachVolume started for volume "pvc-id" (UniqueName: "kubernetes.io/gce-pd/gcp-dynam-pvc-id) from node "node2"

E0515 18:04:26.778812       1 gce_op.go:88] GCE operation failed: googleapi: Error 400: The disk resource 'projects/blabla/zones/blablazone/disks/gcp-dynam-pvc-id' is already being used by 'projects/blabla/zones/blablazone/instances/node1'

E0515 18:04:26.778879       1 attacher.go:92] Error attaching PD "gcp-dynam-pvc-id" to node "node2": googleapi: Error 400: The disk resource 'projects/blabla/zones/blablazone/disks/gcp-dynam-pvc-id' is already being used by 'projects/blabla/zones/blablazone/instances/node1'

E0515 18:04:26.778969       1 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/gcp-dynam-pvc-id\"" failed. No retries permitted until 2018-05-15 18:06:28.778939119 +0000 UTC m=+19053.925176613 (durationBeforeRetry 2m2s). Error: "AttachVolume.Attach failed for volume \"pvc-id\" (UniqueName: \"kubernetes.io/gce-pd/gcp-dynam-pvc-id") from node \"node2\" : googleapi: Error 400: The disk resource 'projects/blabla/zones/blablazone/disks/gcp-dynam-pvc-id' is already being used by 'projects/blabla/zones/blablazone/instances/node1'"

After checking for volumesAttached and volumesInUse for both nodes we saw:

  • on node 1 there were no volumesAttached and no volumesInUse
  • on node 2 there were no volumesAttached, however there was one volumesInUse:
    volumesInUse:
        gcp-dynam-pvc-id
    

By draining node 2 and having the POD restart on node 1 (happened by accident since we have more than 2 nodes in our cluster) the problem was fixed

Due to the requirement of a manual fix there is some downtime and this is a critical POD for our cluster

Hi, I’ve run into this issue as well. We are using the persistent volume to store user uploaded files, much like the wordpress example in the google cloud documentation.

None of the solutions presented work (strategy Recreate or maxSurge: 0). Both caused the new pod being stuck in ContainerCreating stage locked while waiting on volume mount (even after the old pod was removed).

The issue was fixed by deleting the deployment entirely and applying it again. This however leads to outage which is terrible for production purposes (which we thought kubernetes should solve). I think these disadvantages should be NOTED in the documentation wery clearly.

Also since this will always lead to outage (unless fixed some other way) maybe kubernetes/GCE should provide some easy way to run GlusterFS or NFS mounted persistent volumes, which would enable users to have persistent storage and update their app without outage.