kubernetes: When a Pod with a PV is moved to another node stuck in ContainerCreating a long time

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

When i move a Pod with the expression “nodeSelector:” to another Node of the cluster Kubernetes the Pod waiting 8 minutes in the “ContainerCreating” status.

ERRORs:
Warning FailedAttachVolume Multi-Attach error for volume "pvc-7ec40eec-949e-11e7-b96d-fa163ef575ff" Volume is already exclusively attached to one node and can't be attached to another

Multi-Attach error for volume "pvc-7ec40eec-949e-11e7-b96d-fa163ef575ff" (UniqueName: "kubernetes.io/cinder/ab54e390-cace-466f-8624-bdb270fa49ff") from node "knode3" Volume is already exclusively attached to one node and can't be attached to another

After 6 minutes the OpenStack Cinder Volume is attached to the selected node, and the Pod is inicialized. For an application this behavior is very time.

What you expected to happen:

It is expected that after the order of the movement of the POD to another node, the Cinder Volume will be moved to the selected node and Pod start quickly.

How to reproduce it (as minimally and precisely as possible):

Move a Pod with a Persistent Volume (OpenStack Cinder) to another node of the Kubernetes cluster.

Anything else we need to know?:

Log file kubelet: kubelet.txt

Log file kube-controller-manager: kube-controller-manager.txt

Environment:

Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“5”, GitVersion:“v1.5.2”, GitCommit:“269f928217957e7126dc87e6adfa82242bfe5b1e”, GitTreeState:“clean”, BuildDate:“2017-07-03T15:31:10Z”, GoVersion:“go1.7.4”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“7”, GitVersion:“v1.7.5”, GitCommit:“17d7182a7ccbb167074be7a87f0a68bd00d58d97”, GitTreeState:“clean”, BuildDate:“2017-08-31T08:56:23Z”, GoVersion:“go1.8.3”, Compiler:“gc”, Platform:“linux/amd64”}
Cloud provider or hardware configuration**: Openstack Mitaka
OS (e.g. from /etc/os-release): NAME=“CentOS Linux” VERSION=“7 (Core)” ID=“centos”
Kernel (e.g. uname -a):

Linux knode2 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Install tools:
Others:

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 24
Comments: 59 (21 by maintainers)

Most upvoted comments

The lesson learned here: Don’t use kubernetes for database.

+16

themisir on Oct 30, 2020

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

+10

fejta-bot on Jun 12, 2019

I have similar issue in DigitalOcean. If pod is scheduled for deployment on another node it will break as current node+pod are already linked and old pod will not detach before new one is attached.

FIX attempt 1: Add RollingUpdate maxUnavailable: 100% --> FAILED FIX attempt 2: Add FIX1 + add affinity to deploy pod only to one node --> SUCCESS

This means that you will have service for few seconds offline and you will not be able to use cluster nor to scale service to different nodes.

DigitalOcean volumes support only ReadWriteOnce as many others. That means that we need to find some better solution as deployment to one node and accepting downtime is not what Kubernetes is and it heavily undermines entire idea of persistent volumes.

Version: Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:31:33Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}

MichaelOrtho on Jan 13, 2019

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot on May 13, 2019

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot on Apr 13, 2019

I’ve got the same issue with k8s 1.11.0 and ceph using dynamic provisioning. This issue also occures when I do a

kubectl apply -f deployment.yml

As such it’s not possible to modify something without redeploying using delete/apply… 😦 (For me it took much longer than 6min)

eBeyond on Aug 7, 2018

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on Jun 12, 2019

imo this “bug” exist in all volumetypes. If you have pod with pvc(any type, RWX types excluded) running in node1. You will shutdown that node1 -> the pod will start again in some another node but failovering(it will return that multi-attach error) volumes takes 6-10minutes because it will wait force detach.

Options:

I am thinking could we make this force detach time faster, it is currently 6 minutes.
Allow force detach for volume if node has shutdown taint (which was added in #60009, no cloudprovider support it yet). By using this for instance cinder failover times is something like 1minute.

zetaab on Apr 26, 2018

I have the same issue on k8s 1.17.2, rook-ceph as storage. One worker node getting turned off, pod is trying to be evicted after 5 minutes, but can not start because “Volume is exclusively used …by the old pod”. Old pod is getiing stuck in “Terminating”. Workaround: kill the old pod, kill the new pod, wait until the new pod is still unable to start, kill the new pod again. Pretty weak for a cluster solution.

smirnov-mi on Feb 27, 2020

@rootfs @jsafrane @thockin do you have guys idea how we could improve this situation? This volume mount problem has been problem for a long time. I have tried to solve this twice, but always storage or node sigs are saying that my solution is incorrect.

We have customer who is using cronjobs each 5 minute, and they do have volume in it as well. Well, you can imagine what will happen when you will ask volumes to mount every 5 minute, force detach time is 6 minutes. I think we can modify force detach time in cluster, but still it does not remove this problem. It seems that this volume mount problem is in all cloudproviders, sometimes it takes 5-20minutes to get volume in place. 20minutes is quite huge time if your application is running production.

edit: there is another issue for this #65392 (it might solve some of these issues)

zetaab on Dec 18, 2018

Got the same error. We modified the resource request/limits for one statefulset with 3 replicas. K8s moved one of the replicas to another node, which has enough resources, but the volume was still attached to the old node.

K8s version: v1.8.1+coreos.0 Running on AWS

Warning FailedAttachVolume 7m (x2987 over 12m) attachdetach Multi-Attach error for volume “pvc-4fe430e8-db4d-11e7-9931-02138f142c30” Volume is already exclusively attached to one node and can’t be attached to another

nexeck on Dec 7, 2017

What is the status of this on AWS/EBS? I’have the same problem on AWS with v1.9.3

bhack on Apr 26, 2018

I have similar issue in DigitalOcean. If pod is scheduled for deployment on another node it will break as current node+pod are already linked and old pod will not detach before new one is attached.

FIX attempt 1: Add RollingUpdate maxUnavailable: 100% --> FAILED FIX attempt 2: Add FIX1 + add affinity to deploy pod only to one node --> SUCCESS

This means that you will have service for few seconds offline and you will not be able to use cluster nor to scale service to different nodes.

DigitalOcean volumes support only ReadWriteOnce as many others. That means that we need to find some better solution as deployment to one node and accepting downtime is not what Kubernetes is and it heavily undermines entire idea of persistent volumes.

Version: Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:31:33Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}

Having the same issue on DitigalOcean, there’re two things invovled

The update strategy, RollingUpdate vs Recreate. Obviously for zero down time RollingUpdate is preferred. It keeps old pod before new pod is ready. Here comes the problem: the new pod will fail saying “Multi-Attach error for volume “pvc-xxx” Volume is already used by pod(s) xxx”. Changing to Recreate seem to eliminate this error - make sense, it destroys old pod first, leaving some down time, but it ensures volume is completely detached, then new pod schedule & attach volume. Not sure if @MichaelOrtho 's FIX1 equals to Recreate. But like @MichaelOrtho said, this beats one of k8s main purpose of zero downtime. What I see ideally is that with RollingUpdate, k8s should be able to transfer the volume attachment from old pod to new pod. Is this a bug, or it’s just not possible and it’s an expected limitation on k8s’s RollingUpdate?
Node/pod affinity. If you have multiple nodes in node pool, replicaSet may schedule new pod on another node different than the old pod. When so, the new pod will error again - the exact error of OP @diogo-reis , saying “Volume is already exclusively attached to one node and can’t be attached to another” - which links to ReadWriteOnce that only allows volume mount to one node. This error occurs even if your update strategy is Recreate. Current workaround is like @MichaelOrtho mentioned, add affinity to ensure scheduling on one same node. The question is, is this a bug for k8s, at least for Recreate, can k8s detach volume from one node/old pod, and attach it to another node/new pod?

@adipascu and other people on StackOverflow mentioned StatefulSet for stateful app, haven’t tried it yet.

If the above are not considered as k8s bug, on a user/developer experience perspective, I really think maybe we should disable PVC support on Deployment completely, or if PVC used, by default configure (or enforce) Recreate and affinity for the user, or at least highlight this in Deployment’s documentation and guide people to use StatefulSet, since user will absolutely hit the wall when using PVC on Deployment.

rivernews on Jun 30, 2021

/reopen /remove-lifecycle rotten

jonstelly on Aug 15, 2019

I am still having this exact issue on version: v1.12.8 on Google Kubernetes Engine. It happens to me when I run kubectl apply -f app.yaml and make a pod recreate itself. My current fix is to run k delete -f app.yaml before to release the disk and to wait a bit before recreating the pod.

How is this still not fixed? Am I using Kubernetes incorrectly?

Edit: I think StatefulSet should solve this issue.

adipascu on Aug 3, 2019

Just experienced the same on Digital Ocean. The pod is still in ContainerCreating after 13 mins…

postgres-deployment-77c874df64-k4hn9 0/1 ContainerCreating 0 13m

ncri on Dec 13, 2018

We are facing the same issue with k8s 1.9.8 and rbd volumes. But in our case the pod was just redeployed on another node due to changes viakubectl edit deployment ...

dakleine on Sep 28, 2018

I think I just experienced the same issue on AWS …

jackzzj on Sep 18, 2018