kubernetes: Second Pod Using Same PVC Fails WaitForAttach Flakily
The test [sig-storage] CSI Volumes CSI plugin test using CSI driver: hostPath should provision storage
flakes very occasionally. Below are the findings of the circumstances of the flake.
During testDynamicProvisioning
two pods are created in quick succession. The test flakes when the second Pod fails to come up.
The second pod fails to come up with error:
MountVolume.WaitForAttach failed for volume "pvc-bc2294b873ef11e8" : resource name may not be empty
Upon exploring the logs we discovered that there was only 1 Attach call made at the beginning, and one Detach call upon timeout of the test [See timestamps]:
I0613 19:00:41.888473 1 reconciler.go:291] attacherDetacher.AttachVolume started for volume "pvc-0c896e426f3c11e8" (UniqueName: "kubernetes.io/csi/csi-hostpath^0da20430-6f3c-11e8-a80a-0a580a40016c") from node "e2e-40645-674b9-minion-group-gz10"
I0613 19:00:42.533343 1 operation_generator.go:335] AttachVolume.Attach succeeded for volume "pvc-0c896e426f3c11e8" (UniqueName: "kubernetes.io/csi/csi-hostpath^0da20430-6f3c-11e8-a80a-0a580a40016c") from node "e2e-40645-674b9-minion-group-gz10"
I0613 19:15:52.528902 1 reconciler.go:233] attacherDetacher.DetachVolume started for volume "pvc-0c896e426f3c11e8" (UniqueName: "kubernetes.io/csi/csi-hostpath^0da20430-6f3c-11e8-a80a-0a580a40016c") on node "e2e-40645-674b9-minion-group-gz10"
I0613 19:15:52.615262 1 operation_generator.go:1225] Verified volume is safe to detach for volume "pvc-0c896e426f3c11e8" (UniqueName: "kubernetes.io/csi/csi-hostpath^0da20430-6f3c-11e8-a80a-0a580a40016c") on node "e2e-40645-674b9-minion-group-gz10"
I0613 19:15:53.367487 1 operation_generator.go:419] DetachVolume.Detach succeeded for volume "pvc-0c896e426f3c11e8" (UniqueName: "kubernetes.io/csi/csi-hostpath^0da20430-6f3c-11e8-a80a-0a580a40016c") on node "e2e-40645-674b9-minion-group-gz10"
We tracked down the resource name may not be empty
error to the likely culprit in func waitForVolumeAttachmentInternal
:
attach, err := c.k8s.StorageV1beta1().VolumeAttachments().Get(attachID, meta.GetOptions{})
based on the error message returned.
This is likely caused by attachID
(string) being empty [uncomfirmed], which is what the Kubelet would get from the DevicePath field on the Node object. This DevicePath field is normally added on Attach and removed on Detach of the volume.
The DevicePath is clearly present on the Node when the first pod is being brought up. But it is possible that it is somehow missing or not able to be retrieved by the time the second Pod is starting up (and continues to be unavailable exhausting the 15m timeout on the test) even though no Detach operation was performed on the volume in the meantime (according to logs). We are currently unable to reproduce this flake on command and did all analysis based on logs from the two flakes we saw naturally occur in the 100s of runs performed of the test.
This flake was originally seen in #65064 and the test logs can be found here: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/65064/pull-kubernetes-e2e-gce/40645/
Environment: GCE
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:34:11Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.0-alpha.0.975+98194b8630958e", GitCommit:"98194b8630958ee48db662f42404dc0bc6a2ed80", GitTreeState:"clean", BuildDate:"2018-06-19T18:16:46Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: GCP
- OS (e.g. from /etc/os-release):
BUILD_ID=10323.64.0
NAME="Container-Optimized OS"
KERNEL_COMMIT_ID=94bdaa2af49562e7aba0168cae050911d105f5cb
GOOGLE_CRASH_ID=Lakitu
VERSION_ID=65
BUG_REPORT_URL=https://crbug.com/new
PRETTY_NAME="Container-Optimized OS from Google"
VERSION=65
GOOGLE_METRICS_PRODUCT_ID=26
HOME_URL="https://cloud.google.com/compute/docs/containers/vm-image/"
ID=cos
- Kernel (e.g.
uname -a
):Linux e2e-test-dyzz-master 4.4.111+ #1 SMP Wed Mar 28 13:44:38 PDT 2018 x86_64 Intel(R) Xeon(R) CPU @ 2.30GHz GenuineIntel GNU/Linux
/kind flake /kind bug /sig storage /cc @msau42 @jingxu97 @saad-ali @verult
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 45 (38 by maintainers)
I will have a look. Right now the log output of the sidecar containers and the CSI driver are not captured, which makes debugging a bit hard. In PR #69868 I am adding code which includes those message in the test log. Let me finish that PR and once it is merged, debugging the next test flake might provide some better insight into what went wrong.
Kubernetes v1.13 speaks CSI v1.0 and no longer understands CSI v0.x. Any CSI drivers deployed on previous versions of Kubernetes speaking CSI v0.x must removed, updated to be CSI 1.0 compatible, and redeployed on the on Kubernetes v1.13.
If only the master is upgraded to 1.13 (and nodes are 1.12 or earlier) older drivers will continue to work. When nodes are upgraded to 1.13, then older drivers will stop working.
I will capture this in detail in the release notes “Action Required” section.
@pohly Instead of https://github.com/kubernetes/kubernetes/pull/70746 which may have negative side effects for non-CSI volumes, let’s do this instead:
WaitForAttach(...)
method in https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_attacher.go#L126 to ignore the passed inattachID string
(which is populated from devicePath. Instead let’s have that method generate theattachId
from scratch, e.g.attachID := getAttachmentName(volID, driverName, string(nodeName))
This will remove any CSI dependency on devicePath and not effect other volume plugins.