kubeflow: MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found

/kind bug

What steps did you take and what happened: After kubeflow v1.1.0 is installed, none of the below workloads creates. It fails with an error “ReplicaSet “metadata-writer-694c48ccdc” has timed out progressing.; Deployment does not have minimum availability.”.

In the pod events, I see the below error. This is the same errors on all the pods under the mentioned namespace workloads below

Warning FailedMount MountVolume.SetUp failed for volume “webhook-tls-certs” : secret “webhook-server-tls” not found a minute ago
Warning FailedMount Unable to mount volumes for pod “cache-server-65596854d-9r77s_kubeflow(645871c2-acb6-4a72-a0ee-e1c276a639e3)”: timeout expired waiting for volumes to attach or mount for pod “kubeflow”/“cache-server-65596854d-9r77s”. list of unmounted volumes=[webhook-tls-certs]. list of unattached volumes=[webhook-tls-certs kubeflow-pipelines-cache-token-55pr6 istio-envoy sds-uds-path istio-token] 6 minutes ago
Warning FailedMount Unable to mount volumes for pod “cache-server-65596854d-9r77s_kubeflow(645871c2-acb6-4a72-a0ee-e1c276a639e3)”: timeout expired waiting for volumes to attach or mount for pod “kubeflow”/“cache-server-65596854d-9r77s”. list of unmounted volumes=[webhook-tls-certs istio-token]. list of unattached volumes=[webhook-tls-certs kubeflow-pipelines-cache-token-55pr6 istio-envoy sds-uds-path istio-token] an hour ago

Namespace: kubeflow cache-server cache-deployer-deployment

What did you expect to happen: Pods to get initialized and become active

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

  • Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): v1.1.0 tried with both with DEX and without

  • kfctl version: (use kfctl version): kfctl v1.1.0-0-g9a3621e

  • Kubernetes platform: (e.g. minikube)

  • Kubernetes version: (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“19”, GitVersion:“v1.19.0”, GitCommit:“e19964183377d0ec2052d1f1fa930c4d7575bd50”, GitTreeState:“clean”, BuildDate:“2020-08-26T14:30:33Z”, GoVersion:“go1.15”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“15”, GitVersion:“v1.15.12”, GitCommit:“e2a822d9f3c2fdb5c9bfbe64313cf9f657f0a725”, GitTreeState:“clean”, BuildDate:“2020-05-06T05:09:48Z”, GoVersion:“go1.12.17”, Compiler:“gc”, Platform:“linux/amd64”}

  • OS (e.g. from /etc/os-release): CentOS 7.8

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 10
  • Comments: 37 (14 by maintainers)

Most upvoted comments

@ShilpaGopal I’m not sure what the difference is between single-user and multi-user in regards to the cache deployer, but to me it does sound like the issue you are having is due to the cluster signer. I have faced this issue with both kubeflow 1.1 and kubeflow 1.2 with an on-prem deployment.

You could try setting --cluster-signing-cert-file and --cluster-signing-key-file for the kube-controller-manager, then remove the cache-deployer and cache-server deployments, then run kfctl apply -V -f <your-kdef>. Do note that any changes you have made to your configuration that are not in the kustomize folder will be removed when doing this. Also, messing with the cluster signing cert and key file might break other things in the cluster so try at your own risk.

This works for me, kubeflow 1.2.0, rancher 1.9.4. After configed kube-controller-manager like this comment https://github.com/rancher/rancher/issues/14674#issuecomment-535234849, I deleted kfdef and recreated it. Thanks a lot @ShilpaGopal

@ShilpaGopal I’m not sure what the difference is between single-user and multi-user in regards to the cache deployer, but to me it does sound like the issue you are having is due to the cluster signer. I have faced this issue with both kubeflow 1.1 and kubeflow 1.2 with an on-prem deployment.

You could try setting --cluster-signing-cert-file and --cluster-signing-key-file for the kube-controller-manager, then remove the cache-deployer and cache-server deployments, then run kfctl apply -V -f <your-kdef>. Do note that any changes you have made to your configuration that are not in the kustomize folder will be removed when doing this. Also, messing with the cluster signing cert and key file might break other things in the cluster so try at your own risk.

I had set up Kubeflow in Onprem cluster without AUTH, where everything was working fine. Now I added OIDC and redeployed kubeflow, Cache server is not coming up for the same reason. Cache deployer fails with

$ k logs -f pod/cache-deployer-deployment-6f7b78cb7c-6q4cs -n kubeflow main
Start deploying cache service to existing cluster:
+ echo 'Start deploying cache service to existing cluster:'
+ NAMESPACE=kubeflow
+ MUTATING_WEBHOOK_CONFIGURATION_NAME=cache-webhook-kubeflow
+ WEBHOOK_SECRET_NAME=webhook-server-tls
+ kubectl get mutatingwebhookconfigurations cache-webhook-kubeflow --namespace kubeflow --ignore-not-found
The connection to the server 10.96.0.1:443 was refused - did you specify the right host or port?

I don’t see any mutatingwebhookconfigurations by name cache-webhook-kubeflow in the cluster. Any input on this is appreciated