secrets-store-csi-driver-provider-azure: "driver name secrets-store.csi.k8s.io not found in the list of registered CSI drivers" errors during AKS upgrade

What steps did you take and what happened: Today I upgraded my 2-node AKS cluster with 40+ custom pods in total from 1.16.9 to 1.17.7, and during the upgrade (which creates a new node with 1.17.7, then cordons/drains the existing nodes one by one) I got a number of errors like this:

pod/xxx-yyyyy-zzz-zzz-6c7b7fd9f8-sqq69 MountVolume.SetUp failed for volume "csi-key-vault-func-secrets-svc-yyyyy-zzz" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name secrets-store.csi.k8s.io not found in the list of registered CSI drivers

where xxx-yyyyy-zzz-zzz-6c7b7fd9f8-sqq69 is the pod name (anonymized) and csi-key-vault-func-secrets-svc-yyyyy-zzz is the volume name mapped using driver: secrets-store.csi.k8s.io …

In total I had 62 such errors during the whole upgrade (2 nodes with about 40+ custom pods).

What did you expect to happen: No such errors at all

Anything else you would like to add: Related to https://github.com/Azure/secrets-store-csi-driver-provider-azure/issues/101, but I do not see myself modifying any K8s config, as mine is AKS-hosted.

Which access mode did you use to access the Azure Key Vault instance: usePodIdentity: "true"

Environment:

Secrets Store CSI Driver version: (use the image tag):

Image:         docker.io/deislabs/secrets-store-csi:v0.0.11
Image ID:      docker-pullable://deislabs/secrets-store-csi@sha256:824f71fba93d4e43a59c866082ce812d69b6faf16a083ad36233008a5f51a5d6

Azure Key Vault provider version: (use the image tag):

Image:          mcr.microsoft.com/k8s/csi/secrets-store/provider-azure:0.0.6
Image ID:       docker-pullable://mcr.microsoft.com/k8s/csi/secrets-store/provider-azure@sha256:92a5de47c31e22c92d2937cb1cb58842cfc9d079255665274e2b391fb9002ab4

Kubernetes version: (use kubectl version and kubectl get nodes -o wide): 1.16.9 -> 1.17.7
Cluster type: (e.g. AKS, aks-engine, etc): AKS

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 34 (15 by maintainers)

Most upvoted comments

@nilekhc I mitigated this issue by mirroring the images required for the CSI driver in our container registry and using those in the helm chart.

dszakallas on May 8, 2022

@aramase After upgrading the Azure Container Registry to Standard Tier I did another upgrade from 1.17.7 -> 1.17.9 but the disaster repeated - my pods were scheduled to be recreated before the csi-secrets-store and csi-secrets-store-provider-azure, and of course the volume mount failed etc. etc. I had again the unauthorized errors from ACR and so on …

The fix for ACR unauthorized error (https://github.com/kubernetes/kubernetes/pull/92330) has been merged to master and back ported to previous Kubernetes releases. This should be available in the next set of Kubernetes releases and will be supported in AKS. This additional delay that’s being caused during upgrade because of image pull errors should be resolved with the fix.

One thing I do not understand - you said there is no way how to tell K8s to first create the csi-secrets-store* pods on the new node, and then the rest of the pods … but this implies that I will have every time downtime when upgrading … I have many pods which have 1 replica only, as I don’t need more than 1 currently … does it mean I have to increase the replicas to 2 for all my pods?

K8s schedules all pods to the new node as part of the upgrade. There is no primitive to tell K8s to schedule critical addons before work load pods. But in this scenario, the CSI driver is being scheduled along with the workload pods. The image pull for all 3 containers take 1m on average and in some cases close to 1m20s. During this time kubelet attempt to mount volume for workload pods will fail as the driver is not up and running. From the events posted, the driver successfully mounted the volume right after it was started. Kubelet then tried to pull the images for the work load pod which failed with unauthorized error and took close to 2m30s to succeed. This is the reason for the workload pod being down for a long time. The pod start up time is proportional to the time for scheduling, sandbox creation, volume mount, image pull and then starting the container.

I just can’t believe it that everybody else but me is fine with the fact that the CSI driver pods cannot be scheduled somehow before the rest on a new node … this fact makes the csi driver almost unusable for me …

The CSI driver pods are scheduled at the same time as the workload. It’s the buffer around image pull time that can generate mount failed events. But once the CSI driver is up, the mount succeeds as kubelet periodically reconciles the volume mounts. In my tests, the work load images aren’t in ACR. The mount succeeds as soon as the driver pod is running and my workload is also up and running. We’ll switch to using images hosted in MCR as part of next release instead of using images hosted in dockerhub or quay. These images will also be added to the VHD, so the CSI driver starts quickly.

aramase on Aug 11, 2020

Hi @aramase . You are correct. The cluster was provisioned to support windows and I did add a windows nodepool. I saw i needed to on a later version of aks and have subsequently upgraded aks to 1.18. But error was still there. So yes the helm charts were originally installed on the older aks version…should i uninstall and reinstall the helm charts … or is there something else I need to do… for my scenario for this, everything should run on Linux.

@ahmedsza If your workload is running on windows node, then you’ll also need to run the driver on windows node. While deploying it through helm, you can set --windows.enabled=true and --secrets-store-csi-driver.windows.enabled=true. The helm chart configuration options are documented here - https://github.com/Azure/secrets-store-csi-driver-provider-azure/tree/master/charts/csi-secrets-store-provider-azure#configuration

A note that the windows nodepool was not explicitly upgraded. So I guess I need to do that. is there a chance my pod also got allocated to a Linux node, and i need to put in the relevant node selector or taint

If using windows, the required version is 1.18+ as documented here

Recommended Kubernetes version:

For Linux - v1.16.0+
For Windows - v1.18.0+

aramase on Aug 3, 2020