kubeflow: Several pods not starting due to various errors related to using NFS as dynamic provisioner

/kind bug

What steps did you take and what happened:

I started by creating a dynamic NFS provisioner by running

https://github.com/justmeandopensource/kubernetes/blob/master/yamls/nfs-provisioner/rbac.yaml https://github.com/justmeandopensource/kubernetes/blob/master/yamls/nfs-provisioner/default-sc.yaml https://github.com/kubernetes-incubator/external-storage/blob/master/nfs-client/deploy/deployment.yaml

which is from https://github.com/kubernetes-incubator/external-storage/tree/master/nfs-client

I then installed kubeflow with https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_istio_dex.v1.0.0.yaml

Several of the pods do not seem to be starting up due to various issues.

$ kubectl describe -n istio-system pod authservice-0
Name:           authservice-0
Namespace:      istio-system
Priority:       0
Node:           node1.kr.example.com/10.75.38.135
Start Time:     Tue, 03 Mar 2020 15:03:29 -0600
Labels:         app=authservice
                app.kubernetes.io/component=oidc-authservice
                app.kubernetes.io/instance=oidc-authservice-v1.0.0
                app.kubernetes.io/managed-by=kfctl
                app.kubernetes.io/name=oidc-authservice
                app.kubernetes.io/part-of=kubeflow
                app.kubernetes.io/version=v1.0.0
                controller-revision-hash=authservice-5f786759c5
                statefulset.kubernetes.io/pod-name=authservice-0
Annotations:    sidecar.istio.io/inject: false
Status:         Pending
IP:
Controlled By:  StatefulSet/authservice
Containers:
  authservice:
    Container ID:
    Image:          gcr.io/arrikto/kubeflow/oidc-authservice:28c59ef
    Image ID:
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Readiness:      http-get http://:8081/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      USERID_HEADER:  kubeflow-userid
      USERID_PREFIX:
      USERID_CLAIM:   email
      OIDC_PROVIDER:  http://dex.auth.svc.cluster.local:5556/dex
      OIDC_AUTH_URL:  /dex/auth
      OIDC_SCOPES:    profile email groups
      REDIRECT_URL:   /login/oidc
      SKIP_AUTH_URI:  /dex
      PORT:           8080
      CLIENT_ID:      kubeflow-oidc-authservice
      CLIENT_SECRET:  pUBnBOY80SnXgjibTYM9ZWNzY2xreNGQok
      STORE_PATH:     /var/lib/authservice/data.db
    Mounts:
      /var/lib/authservice from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6wg9h (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  authservice-pvc
    ReadOnly:   false
  default-token-6wg9h:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6wg9h
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason       Age                    From                                          Message
  ----     ------       ----                   ----                                          -------
  Warning  FailedMount  34m (x397 over 17h)    kubelet, node1.kr.example.com  Unable to mount volumes for pod "authservice-0_istio-system(10199390-900a-4276-acea-b7aecdf456d7)": timeout expired waiting for volumes to attach or mount for pod "istio-system"/"authservice-0". list of unmounted volumes=[data]. list of unattached volumes=[data default-token-6wg9h]
  Warning  FailedMount  4m39s (x593 over 17h)  kubelet, node1.kr.example.com  (combined from similar events): MountVolume.SetUp failed for volume "pvc-72502148-4c02-4f45-a9aa-4cc19d701503" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/10199390-900a-4276-acea-b7aecdf456d7/volumes/kubernetes.io~nfs/pvc-72502148-4c02-4f45-a9aa-4cc19d701503 --scope -- mount -t nfs dell-ds1.example.com:/k8/istio-system-authservice-pvc-pvc-72502148-4c02-4f45-a9aa-4cc19d701503 /var/lib/kubelet/pods/10199390-900a-4276-acea-b7aecdf456d7/volumes/kubernetes.io~nfs/pvc-72502148-4c02-4f45-a9aa-4cc19d701503
Output: Running scope as unit: run-r177080a878e5475f952104755b41a3e9.scope
mount.nfs: Protocol not supported
$ kubectl logs -n kubeflow mysql-6bcbfbb6b8-rzlf8
2020-03-04 13:35:11+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 5.6.47-1debian9 started.
2020-03-04 13:35:11+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2020-03-04 13:35:11+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 5.6.47-1debian9 started.
mkdir: cannot create directory '/var/lib/mysql/': File exists
$ kubectl logs -n kubeflow katib-db-manager-54b66f9f9d-d5dch
E0304 13:42:21.619878       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
E0304 13:42:26.611773       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
E0304 13:42:31.635879       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
E0304 13:42:36.627814       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
E0304 13:42:41.619904       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
E0304 13:42:46.611779       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
E0304 13:42:51.635869       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
E0304 13:42:56.627784       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
E0304 13:43:01.619889       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
E0304 13:43:06.611712       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
E0304 13:43:11.635854       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
E0304 13:43:16.627642       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.233.23.201:3306: connect: connection refused
F0304 13:43:16.627719       1 main.go:83] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened.
goroutine 1 [running]:
github.com/kubeflow/katib/vendor/k8s.io/klog.stacks(0xc00024a200, 0xc0002520e0, 0x89, 0xd1)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:830 +0xb8
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).output(0xdf1ca0, 0xc000000003, 0xc000278000, 0xd93a76, 0x7, 0x53, 0x0)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:781 +0x2d0
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).printf(0xdf1ca0, 0x3, 0x9b448c, 0x20, 0xc0001e1f20, 0x1, 0x1)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:678 +0x14b
github.com/kubeflow/katib/vendor/k8s.io/klog.Fatalf(...)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:1209
main.main()
        /go/src/github.com/kubeflow/katib/cmd/db-manager/v1alpha3/main.go:83 +0x165
$ kubectl logs -n kubeflow katib-mysql-dcf7dcbd5-djx45
2020-03-04 13:46:11+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 8.0.19-1debian9 started.
2020-03-04 13:46:11+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2020-03-04 13:46:11+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 8.0.19-1debian9 started.
mkdir: cannot create directory '/var/lib/mysql': Permission denied
$ kubectl logs -n kubeflow metadata-db-65fb5b695d-656hw
mkdir: cannot create directory '/var/lib/mysql': Permission denied
$ kubectl logs -n kubeflow metadata-grpc-deployment-75f9888cbf-d9q5m
2020-03-04 13:50:04.660297: F ml_metadata/metadata_store/metadata_store_server_main.cc:219] Non-OK-status: status status: Internal: mysql_real_connect failed: errno: 2002, error: Can't connect to MySQL server on 'metadata-db' (115)MetadataStore cannot be created with the given connection config.

What did you expect to happen: Kubeflow pods all to be up and application functioning.

Anything else you would like to add:

All of my pv, pvc’s are bound. A

$ kubectl get pv,pvc -A
NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                          STORAGECLASS          REASON   AGE
persistentvolume/pvc-0b7c97b5-a650-4355-91bc-00d5de17c4c3   10Gi       RWO            Delete           Bound    kubeflow/metadata-mysql        managed-nfs-storage            16h
persistentvolume/pvc-72502148-4c02-4f45-a9aa-4cc19d701503   10Gi       RWO            Delete           Bound    istio-system/authservice-pvc   managed-nfs-storage            18h
persistentvolume/pvc-bdfbde9e-b056-4f6f-8415-2e8e18bcff7b   20Gi       RWO            Delete           Bound    kubeflow/mysql-pv-claim        managed-nfs-storage            16h
persistentvolume/pvc-c6791418-ae14-42ca-9193-037ef31688d4   10Gi       RWO            Delete           Bound    kubeflow/katib-mysql           managed-nfs-storage            16h
persistentvolume/pvc-c988ed2d-d121-41fd-9cb5-f22c1906d64b   20Gi       RWO            Delete           Bound    kubeflow/minio-pv-claim        managed-nfs-storage            16h

NAMESPACE      NAME                                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS          AGE
istio-system   persistentvolumeclaim/authservice-pvc   Bound    pvc-72502148-4c02-4f45-a9aa-4cc19d701503   10Gi       RWO            managed-nfs-storage   18h
kubeflow       persistentvolumeclaim/katib-mysql       Bound    pvc-c6791418-ae14-42ca-9193-037ef31688d4   10Gi       RWO            managed-nfs-storage   16h
kubeflow       persistentvolumeclaim/metadata-mysql    Bound    pvc-0b7c97b5-a650-4355-91bc-00d5de17c4c3   10Gi       RWO            managed-nfs-storage   16h
kubeflow       persistentvolumeclaim/minio-pv-claim    Bound    pvc-c988ed2d-d121-41fd-9cb5-f22c1906d64b   20Gi       RWO            managed-nfs-storage   16h
kubeflow       persistentvolumeclaim/mysql-pv-claim    Bound    pvc-bdfbde9e-b056-4f6f-8415-2e8e18bcff7b   20Gi       RWO            managed-nfs-storage   16h

Environment:

  • Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard):
  • kfctl version: (use kfctl version): kfctl v1.0-0-g94c35cf
  • Kubernetes platform: kubespray
  • Kubernetes version: (use kubectl version)πŸ˜’ kubectl version Client Version: version.Info{Major:β€œ1”, Minor:β€œ15”, GitVersion:β€œv1.15.3”, GitCommit:β€œ2d3c76f9091b6bec110a5e63777c332469e0cba2”, GitTreeState:β€œclean”, BuildDate:β€œ2019-08-19T11:05:50Z”, GoVersion:β€œgo1.12.9”, Compiler:β€œgc”, Platform:β€œlinux/amd64”} Server Version: version.Info{Major:β€œ1”, Minor:β€œ15”, GitVersion:β€œv1.15.3”, GitCommit:β€œ2d3c76f9091b6bec110a5e63777c332469e0cba2”, GitTreeState:β€œclean”, BuildDate:β€œ2019-08-19T11:05:50Z”, GoVersion:β€œgo1.12.9”, Compiler:β€œgc”, Platform:β€œlinux/amd64”}
  • OS (e.g. from /etc/os-release): Ubuntu 18.04.3

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (1 by maintainers)

Most upvoted comments

@vaskokj Just to add- I did a completely new install with the kubeflow v1.0 release without any issues and without have to change anything. This was using PVCs from nfs-client-provisioner.