kubernetes: EBS detach fails and volume remains busy - v1.5.0-beta.2

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

  • No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):

  • ebs
  • ebs detach

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG REPORT

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"5+", GitVersion:"v1.5.0-beta.2", GitCommit:"0776eab45fe28f02bbeac0f05ae1a203051a21eb", GitTreeState:"clean", BuildDate:"2016-11-24T22:35:03Z", GoVersion:"go1.7.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5+", GitVersion:"v1.5.0-beta.2", GitCommit:"0776eab45fe28f02bbeac0f05ae1a203051a21eb", GitTreeState:"clean", BuildDate:"2016-11-24T22:30:23Z", GoVersion:"go1.7.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.1 LTS
  • Kernel (e.g. uname -a):
Linux kubecontroller-1 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: None
  • Others: None

What happened:

I’ve seen this behavior in 1.4.6 and earlier. When a pod with a PersistentVolumeClaim, in this case when using a StorageClass, is “moved” to another node (terminated on one node then run on the other), the persistent volume occasionally has trouble detaching to the old node. The move can be either from kubectl apply -f <new_configuration.yml> or in this particular example, when draining a node, with a command such as:

kubectl drain worker-3

This does sometimes work. Even when it works, there are plenty of errors with the EBS volume. The kube-controller will throw lots of errors, saying it can’t detach. Such as:

Dec 01 17:41:47 kubecontroller-1 kube-controller-manager[5662]: E1201 17:41:47.344806    5662 attacher.go:73] Error attaching volume "aws://us-east-1c/vol-802faf11": Error attaching EBS volume "vol-802faf11" to instance "i-ca043f59": VolumeInUse: vol-802faf11 is already attached to an instance
Dec 01 17:41:47 kubecontroller-1 kube-controller-manager[5662]:         status code: 400, request id:

There will be other nestedpendingoperations.go operation errors. I’m not sure if that’s a symptom of a misconfiguration. Again, it appears to SOMETIMES work. IF/WHEN it works, I’ll see this in the kube-controller log:

Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]: I1201 17:42:30.821032    5662 aws.go:1492] AttachVolume volume="vol-802faf11" instance="i-ca043f59" request returned {
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]:   AttachTime: 2016-12-01 17:42:30.662 +0000 UTC,
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]:   Device: "/dev/xvdba",
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]:   InstanceId: "i-ca043f59",
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]:   State: "attaching",
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]:   VolumeId: "vol-802faf11"
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]: }
Dec 01 17:42:30 kubecontroller-1 kube-controller-manager[5662]: I1201 17:42:30.920339    5662 aws.go:1366] Waiting for volume "vol-802faf11" state: actual=attaching, desired=attached
Dec 01 17:42:41 kubecontroller-1 kube-controller-manager[5662]: I1201 17:42:41.019194    5662 aws.go:1265] Releasing in-process attachment entry: ba -> volume vol-802faf11

But it will occasionally fail and it will never detach. On the kubelet server, this log appears:

Dec 01 19:38:55 worker-3 kubelet[1032]: I1201 19:38:55.432000    1032 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-802faf11" (spec.Name: "ebstest-volume") from pod "fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336" (UID: "fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336").
Dec 01 19:38:55 worker-3 kubelet[1032]: I1201 19:38:55.432070    1032 aws_ebs.go:398] Error checking if mountpoint /var/lib/kubelet/pods/fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336/volumes/kubernetes.io~aws-ebs/pvc-bfd01b97-b7d2-11e6-8057-0e71c9ba25de: stat /var/lib/kubelet/pods/fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336/volumes/kubernetes.io~aws-ebs/pvc-bfd01b97-b7d2-11e6-8057-0e71c9ba25de: no such file or directory
Dec 01 19:38:55 worker-3 kubelet[1032]: E1201 19:38:55.432132    1032 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1c/vol-802faf11\" (\"fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336\")" failed. No retries permitted until 2016-12-01 19:40:55.432095954 +0000 UTC (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-802faf11" (volume.spec.Name: "ebstest-volume") pod "fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336" (UID: "fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336") with: stat /var/lib/kubelet/pods/fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336/volumes/kubernetes.io~aws-ebs/pvc-bfd01b97-b7d2-11e6-8057-0e71c9ba25de: no such file or directory

And it will repeat. If I try to look in the path specified above on the particular kubelet, /var/lib/kubelet/pods/fff0ca70-b7ee-11e6-a0c7-0e82f0b9f336/volumes/kubernetes.io~aws-ebs/ exists, but the pvc-* does not.

On the kubelet I can actually see the mounted volume. Again, sometimes it is successful and the pod moves to another node. sometimes not.

What you expected to happen:

Upon kubectl drain worker-3, the following actions should occur:

  • pod terminated.
  • EBS persistent volume unmounted.
  • EBS persistent volume mounted on another node.
  • pod created and successfully run on another node where the EBS volume is.

How to reproduce it (as minimally and precisely as possible):

  • Create a storage class for an EBS volume, persistent volume claim, and deployment that has a pod that uses the EBS volume.
  • Drain the particular node. Sometimes this will work as intended.

Anything else do we need to know:

  • I can provide the yml files as necessary.
  • There is a second EBS volume, mounted at /dev/xvde, mounted at /mnt/ebs. This is because the standard AMI root drive is very small. The Docker path is there and the kubelet is symlinked there /var/lib/kubelet -> /mnt/ebs/kubelet. I will test this deployment WITHOUT the EBS volume and symlinks to verify.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 33 (18 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve been repeatedly testing this one. This might be a case of if there is a non-Kubernetes managed EBS volume mounted to the instance, it can cause issues. I have been testing with all the workers having ONLY a root volume (no /mnt/ebs as mentioned in the original issue). So far, no Kubernetes EBS volumes getting stuck in detaching.

Will continue to test.