pachyderm: Can't deploy Pach to GKE default k8s version

Pachyderm won’t deploy to the latest default version of k8s in GKE. This was reported by a user, and I reproduced the issue. The default version in GKE is 1.8.8-gke.0. For this version or greater, the pachd pods errors and goes into CrashLoopBackoff with the following serviceaccount related errors:

time="2018-03-16T20:16:44Z" level=error msg="unable to access kubernetes nodeslist, Pachyderm will continue to work but it will not be possible to use COEFFICIENT parallelism. error: nodes is forbidden: User "system:serviceaccount:default:pachyderm" cannot list nodes at the cluster scope: Unknown user "system:serviceaccount:default:pachyderm""
time="2018-03-16T20:16:44Z" level=error msg="unable to access kubernetes pods, Pachyderm will continue to work but certain pipeline errors will result in pipelines being stuck indefinitely in "starting" state. error: unknown (get pods)"
time="2018-03-16T20:16:44Z" level=error msg="unable to access kubernetes pods, Pachyderm will continue to work but get-logs will not work. error: pods is forbidden: User "system:serviceaccount:default:pachyderm" cannot list pods in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm""
time="2018-03-16T20:16:44Z" level=error msg="unable to create kubernetes replication controllers, Pachyderm will not function properly until this is fixed. error: replicationcontrollers is forbidden: User "system:serviceaccount:default:pachyderm" cannot create replicationcontrollers in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm""
time="2018-03-16T20:16:44Z" level=error msg="unable to delete kubernetes replication controllers, Pachyderm function properly but pipeline cleanup will not work. error: replicationcontrollers "ceb8a1da36ad4700811aa32da3ea8c29" is forbidden: User "system:serviceaccount:default:pachyderm" cannot delete replicationcontrollers in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm""
2018-03-16T20:16:44Z INFO authclient.API.GetCapability {"request":{}}
2018-03-16T20:16:44Z INFO authclient.API.GetCapability {"duration":0.001143887,"request":{},"response":{"capability":"5273272262ac4b06a76752cce2582e35"}}
endpoints "pachd" is forbidden: User "system:serviceaccount:default:pachyderm" cannot get endpoints in the namespace "default": Unknown user "system:serviceaccount:default:pachyderm"

However, if you use --cluster-version 1.7.14-gke.1 or earlier, everything seems to be ok.

To reproduce following the GCP docs for deployment with Pach version 1.7.0rc2.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 3
  • Comments: 18 (10 by maintainers)

Most upvoted comments

This is the workaround that worked for me with the current default version 1.8.8-gke.0 in GCP and Pachyderm version 1.7.0. I’ve also installed Pachyderm on a separate namespace. The installation uses the default RBAC setup as per the official documentation[1].

[1] http://pachyderm.readthedocs.io/en/latest/deployment/google_cloud_platform.html

The workaround is as follows, after running the Pachyderm deployment steps:

$ kubectl delete clusterrolebinding pachyderm
$ kubectl create clusterrolebinding pachyderm --clusterrole=cluster-admin --serviceaccount=pachyderm:pachyderm --namespace=pachyderm --user=system:serviceaccount:default:pachyderm
$ kubectl delete pods --all

The key thing is that the user is set to system:serviceaccount:default:pachyderm and given the cluster-admin role. Is this something that can be set for the serviceaccount settings somewhere?

We should update the docs with the steps need to ensure the gcp service account has the appropriate permissions to be deployed properly.