kubernetes: Deployments with GCE PD fail with "...is already being used by..."
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
From user report on reddit:
I have a deployment with persistent volume claim in Google cloud. One pod is using this volume. Deployment is of "recreate" type. But each time node is feeling under the weather and reschedules this pod to another one, it fails to start with:
googleapi: Error 400: The disk resource 'projects/...-pvc-...' is already being used by 'projects/.../instances/..node-bla-bla'
I've stumbled across some issues on github, but don't see definitive solution. Due to the nature of the problem, I cannot reliably recreate it manually, artificial overload needs to be created.
What I considered doing:
1. Create some sort of gluster/ceph/whateverfs cluster and using it as PV. Con: additional point of failure, needs setup/maintenance of its own.
2. Create separate node pool with 1 node in it and schedule deployment strictly to that pool. Con: doesn't scale neither up nor down, at this point no need in one whole node just for that deployment, but if it grows then problem starts all over.
I've upgraded cluster and nodes to 1.6.7, but don't know if it will matter. Any help appreciated.
Other reports here:
What you expected to happen: Volume should attach to new node without issue.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration**:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 75 (43 by maintainers)
This report mentioned Deployments along with GCE PDs. This can get tricky because in some cases it can result in multiple pods (scheduled to different nodes) referencing the same (read-write once) volume which will cause the second pod to not start.
To prevent this from happening, the general recommendations for using Deployments with GCE PDs is:
replicas
count to1
– Because GCE PDs can only support Read-write attachment to a single node at a time, and if you have more than 1 replica, pods may be scheduled to different nodes.RollingUpdate
. Rolling update strategy has two parametersmaxUnavailable
andmaxSurge
; when not specified they default to 1 and 1 respectively. This means that during a rolling update it requires at least one pod from the old deployment to remain and permits an extra new pod (beyond the requested number of replicas) to be created. When this happens, if the new pod lands on a different node, since the old pod has the disk attached as read-write the new pod will fail to start.However, the reporter mentioned they used the “Recreate” strategy, which means that there must be a bug here.
To help us debug, if you run into this issue, please:
kube-controller-manager
logs from your master (if you’re on GKE contact customer support, reference this issue, and ask them to grab the logs for you). * Your deployment YAML * A description of what commands your ran and when.Let’s figure this out!
CC @kubernetes/sig-storage-bugs
I have looked into different failure scenarios with our recommended settings and have the following conclusions:
Note: This may not be an exhaustive list of failure scenarios. Please contact me if you are experiencing issues with a different scenario.
Note: All scenarios tested with “replicas:1” and deployment with a PVC referencing gce-pd (only supports single node attach).
I can confirm that following works:
While
type: Recreate
does not.It still feels clumsy: it does not prevent the situation from happening, but if I understand right, just recreates pod until it gets on “right” node, where the volume was mounted. I can see pod fail after deployment due to volume being stuck with another node, but after a restart or two pod starts working.
We also see this issue happening. The most annoying thing about GCE is that it does “silent” migration of VMs, after which PVs are not released. So, when Kubernetes wants to start pod on that node, it fails with resource already in use type error.
cc: @sadlil
/remove-lifecycle rotten
Hi, we experienced the same problem with kubernetes 1.9.6:
Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“10”, GitVersion:“v1.10.0”, GitCommit:“fc32d2f3698e36b93322a3465f63a14e9f0eaead”, GitTreeState:“clean”, BuildDate:“2018-03-27T00:13:02Z”, GoVersion:“go1.9.4”, Compiler:“gc”, Platform:“darwin/amd64”}
Server Version: version.Info{Major:“1”, Minor:“9”, GitVersion:“v1.9.6”, GitCommit:“9f8ebd171479bec0ada837d7ee641dec2f8c6dd1”, GitTreeState:“clean”, BuildDate:“2018-03-21T15:13:31Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“linux/amd64”}
Cloud provider or hardware configuration: GCE
Some more details: We use a statefulset for our POD, configured with RollingUpdate strategy.
Pod was restarted from node1 to node2 and stuck in ContainerCreating state due to
controller-manager logs:
After checking for volumesAttached and volumesInUse for both nodes we saw:
By draining node 2 and having the POD restart on node 1 (happened by accident since we have more than 2 nodes in our cluster) the problem was fixed
Due to the requirement of a manual fix there is some downtime and this is a critical POD for our cluster
Hi, I’ve run into this issue as well. We are using the persistent volume to store user uploaded files, much like the wordpress example in the google cloud documentation.
None of the solutions presented work (strategy Recreate or maxSurge: 0). Both caused the new pod being stuck in ContainerCreating stage locked while waiting on volume mount (even after the old pod was removed).
The issue was fixed by deleting the deployment entirely and applying it again. This however leads to outage which is terrible for production purposes (which we thought kubernetes should solve). I think these disadvantages should be NOTED in the documentation wery clearly.
Also since this will always lead to outage (unless fixed some other way) maybe kubernetes/GCE should provide some easy way to run GlusterFS or NFS mounted persistent volumes, which would enable users to have persistent storage and update their app without outage.