vsphere-csi-driver: CNS volumes disappear and all in cluster operations fail

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: In 2 separate kubernetes clusters we have observed failures when pods get moved and as a result the PVs have to be attached to the new node. The CNS volume object seems to just disappear, the FCD is still intact on the vsan. Nothing has triggered a deletion as far as I can see…

Events from the kubernetes side:

kubectl get event -n workload
LAST SEEN   TYPE      REASON               OBJECT                       MESSAGE
26s         Normal    Scheduled            pod/redis-754fbf4bd-26lbf    Successfully assigned workload/redis-754fbf4bd-26lbf to worker3
11s         Warning   FailedAttachVolume   pod/redis-754fbf4bd-26lbf    AttachVolume.Attach failed for volume "pvc-202662bf-3ce7-40a0-96af-d22d58198dce" : rpc error: code = Internal desc = failed to attach disk: "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" with node: "worker3" err ServerFaultCode: Received SOAP response fault from [<cs p:00007fa4dc0a6290, TCP:localhost:443>]: retrieveVStorageObject

Logs from csi-controller:

2021-09-17 12:27:36	
I0917 09:27:36.946521       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:36	
I0917 09:27:36.340379       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
2021-09-17 12:27:35	
I0917 09:27:35.969819       1 csi_handler.go:226] Error processing "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba": failed to detach: rpc error: code = Internal desc = volumeID "eb05ab8b-dcd0-4217-9c34-3ec8bda666a9" not found in QueryVolume
2021-09-17 12:27:35	
I0917 09:27:35.969770       1 csi_handler.go:612] Saved detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
I0917 09:27:35.967871       1 controller.go:158] Ignoring VolumeAttachment "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba" change
2021-09-17 12:27:35	
I0917 09:27:35.952349       1 csi_handler.go:601] Saving detach error to "csi-697720ade9eeaa3b9851f3276fb8a3270cda2ff287e7a44690e09c7a9b3bcfba"
2021-09-17 12:27:35	
{"level":"error","time":"2021-09-17T09:27:35.951799411Z","caller":"vanilla/controller.go:883","msg":"volumeID \"eb05ab8b-dcd0-4217-9c34-3ec8bda666a9\" not found in QueryVolume","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3","stacktrace":"sigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:883\nsigs.k8s.io/vsphere-csi-driver/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:937\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler.func1\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.2.0/lib/go/csi/csi.pb.go:5200\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).controllerUnpublishVolume\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/middleware/serialvolume/serial_volume_locker.go:141\ngithub.com/rexray/gocsi/middleware/serialvolume.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/middleware/serialvolume/serial_volume_locker.go:88\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer.func1\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/middleware/specvalidator/spec_validator.go:178\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handle\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/middleware/specvalidator/spec_validator.go:218\ngithub.com/rexray/gocsi/middleware/specvalidator.(*interceptor).handleServer\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/middleware/specvalidator/spec_validator.go:177\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi.(*StoragePlugin).injectContext\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/middleware.go:231\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2.1.1\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/utils/utils_middleware.go:99\ngithub.com/rexray/gocsi/utils.ChainUnaryServer.func2\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/utils/utils_middleware.go:106\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.2.0/lib/go/csi/csi.pb.go:5202\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:722"}
2021-09-17 12:27:35	
{"level":"info","time":"2021-09-17T09:27:35.904885335Z","caller":"vanilla/controller.go:857","msg":"ControllerUnpublishVolume: called with args {VolumeId:eb05ab8b-dcd0-4217-9c34-3ec8bda666a9 NodeId:worker3 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"e3830e8e-5954-433b-a777-e5623668c7b3"}
2021-09-17 12:27:35	
I0917 09:27:35.903737       1 csi_handler.go:715] Found NodeID worker3 in CSINode worker3

And some entries from vsanvcmgmtd:

2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [vSAN@6876 sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:29.157Z verbose vsanvcmgmtd[10186] [vSAN@6876 sub=PyBackedMO opId=sps-Main-13723-334-8a1a]  Exit  vasa.NotificationManager.getAlarm (0 ms)
2021-09-15T22:18:38.683Z info vsanvcmgmtd[07066] [vSAN@6876 sub=AccessChecker] User CLISTER.LOCAL\user@10.170.100.178 was authenticated with soap session id. 52db629f-f84a-15b2-6401-b41b25af9ec7 (52a55bd9-9ac9-d4bd-d9f0-d6145bb4f7a5)
2021-09-15T22:18:38.704Z verbose vsanvcmgmtd[07065] [vSAN@6876 sub=PyBackedMO opId=07dc8a1b]  Enter vim.cns.VolumeManager.queryAll, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.706Z verbose vsanvcmgmtd[07065] [vSAN@6876 sub=PyBackedMO opId=07dc8a1b]  Exit  vim.cns.VolumeManager.queryAll (1 ms)
2021-09-15T22:18:38.731Z verbose vsanvcmgmtd[07104] [vSAN@6876 sub=PyBackedMO opId=07dc8a1c]  Enter vim.cns.VolumeManager.query, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.885Z verbose vsanvcmgmtd[07104] [vSAN@6876 sub=PyBackedMO opId=07dc8a1c]  Exit  vim.cns.VolumeManager.query (154 ms)
2021-09-15T22:18:38.923Z verbose vsanvcmgmtd[10186] [vSAN@6876 sub=PyBackedMO opId=07dc8a1d]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [vSAN@6876 sub=VolumeManager opId=07dc8a1d] CNS: create volume task created: task-314869
2021-09-15T22:18:38.928Z info vsanvcmgmtd[10186] [vSAN@6876 sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task scheduled.
2021-09-15T22:18:38.928Z verbose vsanvcmgmtd[10186] [vSAN@6876 sub=PyBackedMO opId=07dc8a1d]  Exit  vim.cns.VolumeManager.create (4 ms)
2021-09-15T22:18:38.933Z info vsanvcmgmtd[25351] [vSAN@6876 sub=VolumeManager opId=07dc8a1d] CNS: Creating volume task started
2021-09-15T22:18:39.029Z error vsanvcmgmtd[25351] [vSAN@6876 sub=VolumeManager opId=07dc8a1d] CNS: backingDiskId not found: eb05ab8b-dcd0-4217-9c34-3ec8bda666a9, N3Vim5Fault8NotFound9ExceptionE(Fault cause: vim.fault.NotFound
--> )
--> [context]zKq7AVECAAAAAFnREgEVdnNhbnZjbWdtdGQAAFy7KmxpYnZtYWNvcmUuc28AAAw5GwB+uxgBNG38bGlidmltLXR5cGVzLnNvAIG7mg8BgVrlDwECtWEObGlidm1vbWkuc28AAspdEQJXXxEDHuoCbGliUHlDcHBWbW9taS5zbwACHJISAvGKEgTozgNsaWJ2c2xtLXR5cGVzLnNvAAXF3AdfY25zLnNvAAVN1gUFixQGABYkIwCSJiMA5RIrBtRzAGxpYnB0aHJlYWQuc28uMAAHzY4ObGliYy5zby42AA==[/context]
2021-09-15T22:18:39.031Z info vsanvcmgmtd[25351] [vSAN@6876 sub=VolumeManager opId=07dc8a1d] Create volume completed: task-314869
2021-09-15T22:18:39.054Z verbose vsanvcmgmtd[12908] [vSAN@6876 sub=PyBackedMO opId=07dc8a1e]  Enter vim.cns.VolumeManager.create, Pending: 1 (52db629f-f84a-15b2-6401-b41b25af9ec7)
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [vSAN@6876 sub=VolumeManager opId=07dc8a1e] CNS: create volume task created: task-314870
2021-09-15T22:18:39.057Z info vsanvcmgmtd[12908] [vSAN@6876 sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task scheduled.
2021-09-15T22:18:39.057Z verbose vsanvcmgmtd[12908] [vSAN@6876 sub=PyBackedMO opId=07dc8a1e]  Exit  vim.cns.VolumeManager.create (3 ms)
2021-09-15T22:18:39.062Z info vsanvcmgmtd[43408] [vSAN@6876 sub=VolumeManager opId=07dc8a1e] CNS: Creating volume task started
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [vSAN@6876 sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  Enter vasa.NotificationManager.getAlarm, Pending: 1 (5269718e-b4b7-0511-b82b-b8f707de46d5)
2021-09-15T22:18:39.158Z verbose vsanvcmgmtd[07049] [vSAN@6876 sub=PyBackedMO opId=sps-Main-13723-334-8a1f]  

What you expected to happen:

CNS volume objects should not just disappear?

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • csi-vsphere version: image
  • vsphere-cloud-controller-manager version: v1.20.0
  • Kubernetes version: 1.20.10
  • vSphere version: 6.7u3 (6.7.0.48000)
  • OS (e.g. from /etc/os-release): ubuntu 20.04
  • Kernel (e.g. uname -a): 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 26 (8 by maintainers)

Most upvoted comments

@jingxu97 Have you deployed 2 K8s clusters with the same cluster id? We have seen similar things happening in the past when there are 2 K8s clusters with the same cluster id.