azuredisk-csi-driver: Snapshot restore PVC cannot mount

What happened: External Storage [Driver: disk.csi.azure.com] [Testpattern: Pre-provisioned Snapshot (delete policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller should check snapshot fields, check restore correctly works after modifying source data, check deletion

I am running e2e test snapshottable test against the azuredisk driver. When I was running into this test, the pod using the restored PVC fail to start.

The following event occur during mount:

Normal   Scheduled               <unknown>            default-scheduler                          Successfully assigned snapshotting-1543 /restored-pvc-tester-9jxsp to nodepool-f985db88-68ff3c05000001                                                                             
Normal   SuccessfulAttachVolume  3m38s                attachdetach-controller                    AttachVolume.Attach succeeded for volume "pvc-94f46f6d-9d0d-4ec9-b8c3-fce0564d401f"                                                                                               
Warning  FailedMount             110s                 kubelet, nodepool-f985db88-68ff3c05000001  Unable to attach or mount volumes: unmounted volumes=[my-volume], unattached volumes=[my-volume default-token-5bmjn]: timed out waiting for the condition                         
Warning  FailedMount             83s (x9 over 3m37s)  kubelet, nodepool-f985db88-68ff3c05000001  MountVolume.MountDevice failed for volume "pvc-94f46f6d-9d0d-4ec9-b8c3-fce0564d401f" : rpc error: code = Internal desc = could not format "/dev/disk/azure/scsi1/lun2"(lun: "2"), and mount it at "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-94f46f6d-9d0d-4ec9-b8c3-fce0564d401f/globalmount"

The following is the log for azuredisk

2020-11-20T18:42:09.570726156Z stderr F I1120 18:42:09.570645       1 utils.go:108] GRPC call: /csi.v1.Controller/ControllerPublishVolume
2020-11-20T18:42:09.573608274Z stderr F I1120 18:42:09.570762       1 utils.go:109] GRPC request: {"node_id":"nodepool-f985db88-68ff3c05000001","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"csi.storage.
k8s.io/pv/name":"pvc-94f46f6d-9d0d-4ec9-b8c3-fce0564d401f","csi.storage.k8s.io/pvc/name":"pvc-4phg8","csi.storage.k8s.io/pvc/namespace":"snapshotting-1543","skuName":"StandardSSD_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1605893975912-8081-disk.csi.azur
e.com"},"volume_id":"/subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-94f46f6d-9d0d-4ec9-b8c3-fce0564d401f"}
2020-11-20T18:42:09.607054884Z stderr F I1120 18:42:09.599641       1 controllerserver.go:402] GetDiskLun returned: <nil>. Initiating attaching volume "/subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.C
ompute/disks/pvc-94f46f6d-9d0d-4ec9-b8c3-fce0564d401f" to node "nodepool-f985db88-68ff3c05000001".
2020-11-20T18:42:09.607073084Z stderr F I1120 18:42:09.599676       1 azure_vmss.go:185] Couldn't find VMSS VM with nodeName nodepool-f985db88-68ff3c05000001, refreshing the cache
2020-11-20T18:42:09.662052827Z stderr F I1120 18:42:09.661792       1 controllerserver.go:419] Trying to attach volume "/subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-94f46f6d-9d0d-4
ec9-b8c3-fce0564d401f" to node "nodepool-f985db88-68ff3c05000001"
2020-11-20T18:42:09.731785063Z stderr F I1120 18:42:09.731666       1 azure_controller_common.go:183] Trying to attach volume "/subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-94f46f6d
-9d0d-4ec9-b8c3-fce0564d401f" lun 2 to node "nodepool-f985db88-68ff3c05000001".
2020-11-20T18:42:09.731925464Z stderr F I1120 18:42:09.731862       1 azure_controller_vmss.go:101] azureDisk - update(anthos_cluster_f985db88): vm(nodepool-f985db88-68ff3c05000001) - attach disk(pvc-94f46f6d-9d0d-4ec9-b8c3-fce0564d401f, /subscriptions/e317f310-5d
67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-94f46f6d-9d0d-4ec9-b8c3-fce0564d401f) with DiskEncryptionSetID()
2020-11-20T18:42:15.287784494Z stderr F I1120 18:42:15.287706       1 utils.go:108] GRPC call: /csi.v1.Controller/ControllerUnpublishVolume
2020-11-20T18:42:15.288790301Z stderr F I1120 18:42:15.287983       1 utils.go:109] GRPC request: {"node_id":"nodepool-f985db88-68ff3c05000001","volume_id":"/subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Micros
oft.Compute/disks/pvc-b302c090-9204-4071-a506-1977f5cbe141"}
2020-11-20T18:42:15.288911101Z stderr F I1120 18:42:15.288861       1 controllerserver.go:470] Trying to detach volume /subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-b302c090-9204-40
71-a506-1977f5cbe141 from node nodepool-f985db88-68ff3c05000001
2020-11-20T18:42:15.289058102Z stderr F I1120 18:42:15.288995       1 azure_controller_common.go:208] detach /subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-b302c090-9204-4071-a506-19
77f5cbe141 from node "nodepool-f985db88-68ff3c05000001"
2020-11-20T18:42:24.95906011Z stderr F I1120 18:42:24.958952       1 azure_controller_vmss.go:113] azureDisk - update(anthos_cluster_f985db88): vm(nodepool-f985db88-68ff3c05000001) - attach disk(pvc-94f46f6d-9d0d-4ec9-b8c3-fce0564d401f, /subscriptions/e317f310-5d6
7-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-94f46f6d-9d0d-4ec9-b8c3-fce0564d401f)  returned with <nil>
2020-11-20T18:42:24.959193211Z stderr F I1120 18:42:24.959146       1 controllerserver.go:423] Attach operation successful: volume "/subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-94f
46f6d-9d0d-4ec9-b8c3-fce0564d401f" attached to node "nodepool-f985db88-68ff3c05000001".
2020-11-20T18:42:24.959266711Z stderr F I1120 18:42:24.959225       1 controllerserver.go:438] attach volume "/subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-94f46f6d-9d0d-4ec9-b8c3-f
ce0564d401f" to node "nodepool-f985db88-68ff3c05000001" successfully
2020-11-20T18:42:24.959777715Z stderr F I1120 18:42:24.959319       1 utils.go:114] GRPC response: {"publish_context":{"LUN":"2"}}
2020-11-20T18:42:24.959914715Z stderr F I1120 18:42:24.959879       1 azure_vmss.go:185] Couldn't find VMSS VM with nodeName nodepool-f985db88-68ff3c05000001, refreshing the cache
2020-11-20T18:42:25.020763898Z stderr F I1120 18:42:25.020664       1 azure_controller_vmss.go:145] azureDisk - detach disk: name "pvc-b302c090-9204-4071-a506-1977f5cbe141" uri "/subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985
db88/providers/Microsoft.Compute/disks/pvc-b302c090-9204-4071-a506-1977f5cbe141"
2020-11-20T18:42:25.020853799Z stderr F I1120 18:42:25.020817       1 azure_controller_vmss.go:171] azureDisk - update(anthos_cluster_f985db88): vm(nodepool-f985db88-68ff3c05000001) - detach disk(pvc-b302c090-9204-4071-a506-1977f5cbe141, /subscriptions/e317f310-5d
67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-b302c090-9204-4071-a506-1977f5cbe141)
2020-11-20T18:42:40.215779083Z stderr F I1120 18:42:40.215669       1 azure_controller_vmss.go:183] azureDisk - update(anthos_cluster_f985db88): vm(nodepool-f985db88-68ff3c05000001) - detach disk(pvc-b302c090-9204-4071-a506-1977f5cbe141, /subscriptions/e317f310-5d
67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-b302c090-9204-4071-a506-1977f5cbe141) returned with <nil>
2020-11-20T18:42:40.215917284Z stderr F I1120 18:42:40.215875       1 azure_controller_common.go:242] azureDisk - detach disk(pvc-b302c090-9204-4071-a506-1977f5cbe141, /subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/provi
ders/Microsoft.Compute/disks/pvc-b302c090-9204-4071-a506-1977f5cbe141) succeeded
2020-11-20T18:42:40.215988784Z stderr F I1120 18:42:40.215952       1 controllerserver.go:479] detach volume /subscriptions/e317f310-5d67-42b3-8d6d-2a8c757f8cb8/resourceGroups/anthos_cluster_f985db88/providers/Microsoft.Compute/disks/pvc-b302c090-9204-4071-a506-19
77f5cbe141 from node nodepool-f985db88-68ff3c05000001 successfully
2020-11-20T18:42:40.216452587Z stderr F I1120 18:42:40.216022       1 utils.go:114] GRPC response: {}

What you expected to happen: The preprovisioned snapshot test pass

How to reproduce it: Run snapshottable e2e test

Anything else we need to know?:

Environment:

  • CSI Driver version: 0.9.0
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/cc @msau42

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 39 (34 by maintainers)

Commits related to this issue

Most upvoted comments

@Jiawei0227 As discussed earlier today, I tried to repro this on a single node, i.e. without the e2e test and looks like the issue does appear to be OS/multipath specific. I don’t see multipath enabled on ubuntu 18.04 and reading online suggests that its enabled by default on 20.04 and that’s probably the reason it repros on your end and not mine.

To find right resolution for this can you send me the contents of /etc/multipath.conf. It appears that the simplest resolution would be to set something like this in the config at boot or disable mutltipath altogether:

blacklist { devnode "^sd[a-z]" ### blacklist all scsi Devices }

Also, from where exactly you’re picking up this OS image, is it from the azure catalog or it’s a custom image from your end?

We had a debug session and some more interesting information. It seems somehow the bug is related to multipath. We try to attach the restored PVC on another new node and it succeed without any issue, which proves that the Snapshot itself is correct.

Somehow, when we attach the disk to the node, there will be a multipath service running whereas in the new node there is not.

Another interesting experiment we did is we attached both new and old PVC to the old node and multipath is treating these two devices as the same and treat them as two different path.

ubuntu@vmss-np-a3d27726-anthos-ae2ff2ea000000:/dev$ sudo multipath -ll
mpatha (3600224801400649cefe46cf53e8f174b) dm-0 Msft,Virtual Disk
size=5.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 1:0:0:1 sdd 8:48 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 1:0:0:0 sdc 8:32 active ready running

However on the new node there is no multipath running when disk attached.

Will do some more investigation on it.