kubernetes: EBS detach fails and volume remains busy - v1.5.0-beta.2

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):

ebs
ebs detach

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG REPORT

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"5+", GitVersion:"v1.5.0-beta.2", GitCommit:"0776eab45fe28f02bbeac0f05ae1a203051a21eb", GitTreeState:"clean", BuildDate:"2016-11-24T22:35:03Z", GoVersion:"go1.7.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5+", GitVersion:"v1.5.0-beta.2", GitCommit:"0776eab45fe28f02bbeac0f05ae1a203051a21eb", GitTreeState:"clean", BuildDate:"2016-11-24T22:30:23Z", GoVersion:"go1.7.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Ubuntu 16.04.1 LTS
Kernel (e.g. uname -a):

Linux kubecontroller-1 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Install tools: None
Others: None

What happened:

I’ve seen this behavior in 1.4.6 and earlier. When a pod with a PersistentVolumeClaim, in this case when using a StorageClass, is “moved” to another node (terminated on one node then run on the other), the persistent volume occasionally has trouble detaching to the old node. The move can be either from kubectl apply -f <new_configuration.yml> or in this particular example, when draining a node, with a command such as:

kubectl drain worker-3

This does sometimes work. Even when it works, there are plenty of errors with the EBS volume. The kube-controller will throw lots of errors, saying it can’t detach. Such as:

Dec 01 17:41:47 kubecontroller-1 kube-controller-manager[5662]: E1201 17:41:47.344806    5662 attacher.go:73] Error attaching volume "aws://us-east-1c/vol-802faf11": Error attaching EBS volume "vol-802faf11" to instance "i-ca043f59": VolumeInUse: vol-802faf11 is already attached to an instance
Dec 01 17:41:47 kubecontroller-1 kube-controller-manager[5662]:         status code: 400, request id:

There will be other nestedpendingoperations.go operation errors. I’m not sure if that’s a symptom of a misconfiguration. Again, it appears to SOMETIMES work. IF/WHEN it works, I’ll see this in the kube-controller log:

Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]: I1201 17:42:30.821032    5662 aws.go:1492] AttachVolume volume="vol-802faf11" instance="i-ca043f59" request returned {
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]:   AttachTime: 2016-12-01 17:42:30.662 +0000 UTC,
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]:   Device: "/dev/xvdba",
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]:   InstanceId: "i-ca043f59",
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]:   State: "attaching",
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]:   VolumeId: "vol-802faf11"
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]: }
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]: I1201 17:42:30.920339    5662 aws.go:1366] Waiting for volume "vol-802faf11" state: actual=attaching, desired=attached
Dec 01 17:42:41 kubecontroller-1 kube-controller-manager[5662]: I1201 17:42:41.019194    5662 aws.go:1265] Releasing in-process attachment entry: ba -> volume vol-802faf11

But it will occasionally fail and it will never detach. On the kubelet server, this log appears:

Dec 01 19:38:55 worker-3 kubelet[1032]: I1201 19:38:55.432000    1032 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-802faf11" (spec.Name: "ebstest-volume") from pod "fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336" (UID: "fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336").
Dec 01 19:38:55 worker-3 kubelet[1032]: I1201 19:38:55.432070    1032 aws_ebs.go:398] Error checking if mountpoint /var/lib/kubelet/pods/fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336/volumes/kubernetes.io~aws-ebs/pvc-bfd01b97-b7d2-11e6-8057-0e71c9ba25de: stat /var/lib/kubelet/pods/fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336/volumes/kubernetes.io~aws-ebs/pvc-bfd01b97-b7d2-11e6-8057-0e71c9ba25de: no such file or directory
Dec 01 19:38:55 worker-3 kubelet[1032]: E1201 19:38:55.432132    1032 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1c/vol-802faf11\" (\"fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336\")" failed. No retries permitted until 2016-12-01 19:40:55.432095954 +0000 UTC (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-802faf11" (volume.spec.Name: "ebstest-volume") pod "fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336" (UID: "fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336") with: stat /var/lib/kubelet/pods/fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336/volumes/kubernetes.io~aws-ebs/pvc-bfd01b97-b7d2-11e6-8057-0e71c9ba25de: no such file or directory

And it will repeat. If I try to look in the path specified above on the particular kubelet, /var/lib/kubelet/pods/fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336/volumes/kubernetes.io~aws-ebs/ exists, but the pvc-* does not.

On the kubelet I can actually see the mounted volume. Again, sometimes it is successful and the pod moves to another node. sometimes not.

What you expected to happen:

Upon kubectl drain worker-3, the following actions should occur:

pod terminated.
EBS persistent volume unmounted.
EBS persistent volume mounted on another node.
pod created and successfully run on another node where the EBS volume is.

How to reproduce it (as minimally and precisely as possible):

Create a storage class for an EBS volume, persistent volume claim, and deployment that has a pod that uses the EBS volume.
Drain the particular node. Sometimes this will work as intended.

Anything else do we need to know:

I can provide the yml files as necessary.
There is a second EBS volume, mounted at /dev/xvde, mounted at /mnt/ebs. This is because the standard AMI root drive is very small. The Docker path is there and the kubelet is symlinked there /var/lib/kubelet -> /mnt/ebs/kubelet. I will test this deployment WITHOUT the EBS volume and symlinks to verify.

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 33 (18 by maintainers)

Commits related to this issue

Merge pull request #40417 from jsravn/fix-reconciler-external-updates-race Automatic merge from submit-queue (batch tested with PRs 41531, 40417, 41434) Always detach volumes in operator executor *... — committed to kubernetes/kubernetes by deleted user 7 years ago

Most upvoted comments

I’ve been repeatedly testing this one. This might be a case of if there is a non-Kubernetes managed EBS volume mounted to the instance, it can cause issues. I have been testing with all the workers having ONLY a root volume (no /mnt/ebs as mentioned in the original issue). So far, no Kubernetes EBS volumes getting stuck in detaching.

Will continue to test.

bonovoxly on Dec 2, 2016