vsphere-csi-driver: failed to get CsiNodeTopology for the node

What happened: vsphere-csi-node DaemonSet node-driver-registrar fails with failed to get CsiNodeTopology for the node

I0322 12:55:25.883806       1 main.go:166] Version: v2.5.0
I0322 12:55:25.883841       1 main.go:167] Running node-driver-registrar in mode=registration
I0322 12:55:25.884289       1 main.go:191] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0322 12:55:25.884310       1 connection.go:154] Connecting to unix:///csi/csi.sock
I0322 12:55:25.884693       1 main.go:198] Calling CSI driver to discover driver name
I0322 12:55:25.884717       1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I0322 12:55:25.884721       1 connection.go:184] GRPC request: {}
I0322 12:55:25.886858       1 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v2.5.0"}
I0322 12:55:25.886926       1 connection.go:187] GRPC error: <nil>
I0322 12:55:25.886933       1 main.go:208] CSI driver name: "csi.vsphere.vmware.com"
I0322 12:55:25.886971       1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I0322 12:55:25.887559       1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I0322 12:55:25.887693       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0322 12:55:27.616658       1 main.go:102] Received GetInfo call: &InfoRequest{}
I0322 12:55:27.617124       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration"
I0322 12:55:27.636091       1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "talos-10-120-8-82". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",}
E0322 12:55:27.636112       1 main.go:122] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "talos-10-120-8-82". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

What you expected to happen: The only information I can find on CSINodeTopology with respect to this driver is on the guide Deploying vSphere Container Storage Plug-in with Topology, however, I do NOT have the 2 arguments for the external-provisioner sidecar uncommented as instructed. Other than that, I can’t even locate the CSINodeTopology cns.vmware.com/v1alpha1 CRD.

How to reproduce it (as minimally and precisely as possible): Deploy the vsphere-csi-driver as instructed at Install vSphere Container Storage Plug-in

Anything else we need to know?:

Environment:

  • csi-vsphere version: v2.5.0
  • vsphere-cloud-controller-manager version: v1.22.5
  • Kubernetes version: v1.23.4
  • vSphere version: v7.0.3
  • OS (e.g. from /etc/os-release): talos v1.0.0-beta.1
  • Install tools: kubectl

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 36 (1 by maintainers)

Most upvoted comments

I can confirm that I’ve hit this problem also (new deployment, vsphere 7.0u3, k3s v1.24.4+k3s1)

as mentioned here, and in https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1948 the default setting of improved-volume-topology: 'true' in vsphere-csi-driver.yaml

seems to be the cause, and changing it to false allows the pods to deploy.

If you don’t need the feature you may set improved-volume-topology: 'false' in ConfigMaps/internal-feature-states.csi.vsphere.vmware.com. Otherwise this can fail for multiple reasons (e.g. as pointed out because of missing permissions in vCenter). Simply disabling the feature we didn’t want to use fixed the issue for us. It seems as it is enabled by default in more recent vSphere CSI releases. I’m not sure why this is needed since the manifest still has the comments args you’d need to enable toplogy awareness. The new feature gates are not very well documented.

Enabling Enable Improved Volume Topology causes this error for us. Removing the selection and redeploying brings it online and stable.

@shalini-b I think i’m making forward progress, my current deployment succeeds, but fails to launch all the containers in the daemonset:

each deamonset container vmware-system-csi/vsphere-csi-node-2pdr4:node-driver-registrar presents a similar log description:

I0518 18:00:18.524418       1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to retrieve topology information for Node: "juju-78116a-5". Error: "failed to retrieve nodeVM \"f3ef2942-6724-aff1-6ddb-cea417d0f5aa\" using the node manager. Error: virtual machine wasn't found",}
E0518 18:00:18.524476       1 main.go:122] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to retrieve topology information for Node: "juju-78116a-5". Error: "failed to retrieve nodeVM \"f3ef2942-6724-aff1-6ddb-cea417d0f5aa\" using the node manager. Error: virtual machine wasn't found", restarting registration container.

the machine’s UUID:

ubuntu@juju-78116a-5:~$ sudo dmidecode | grep UUID
    UUID: 4229eff3-2467-f1af-6ddb-cea417d0f5aa

the machines provider-id:

Name:               juju-78116a-5
ProviderID:                   vsphere://4229eff3-2467-f1af-6ddb-cea417d0f5aa

both match. But the CSI-node driver is seems to swap the bytes around?

4229eff3-2467-f1af-6ddb-cea417d0f5aa  # from provider-id and dmidecode
f3ef2942-6724-aff1-6ddb-cea417d0f5aa # from container logs

if i reverse the first 12 bytes, they match

AABBCCDD-EEFF-GGHH-IIJJ-KKLLMMNNOOPP
DDCCBBAA-FFEE-HHGG-IIJJ-KKLLMMNNOOPP

edit: i think this is https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1629, updating to v2.5.1 images final-edit: cool. I can make PVCs! way to go

So for my setup, this issue was caused by the vSphere CPI not working correctly and thus not untainting the nodes which never allowed the csi pods to run and I believe one of them is responsible for creating the CRD.

My CPI issue is documented here: https://github.com/kubernetes/cloud-provider-vsphere/issues/614