kubeflow: GKE Kubeflow google cloud SDK "ERROR: timed out"

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.]

Deploy Kubeflow 0.7 on GKE
Run a Pods(docker image google/cloud-sdk:273.0.0)
start to use google cloud SDK(gsutil)
gsutil and glcoud start to timeout few hours later

The timeout error does not occur initially, but begins to occur after a few hours. Once it occurs, the timeout error continues to occur. I did some research and couldn’t find a reason. If anyone knows the reason for this, please help.

gsutil timeout message

yoon@daangn-gpu2:~$ k exec -it -n kubeflow yoon-preprocess-test-pod /bin/bash
root@yoon-preprocess-test-pod:/app# gsutil ls gs://<GS_URL>
ERROR: (gsutil) timed out
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.

gcloud timeout message

yoon@daangn-gpu2:~$ k exec -it -n kubeflow yoon-preprocess-test-pod /bin/bash
root@yoon-preprocess-test-pod:/app# gcloud auth list
ERROR: (gcloud.auth.list) timed out
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.

What did you expect to happen: No problem using google cloud sdk

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Kubeflow version: build version 0.7.0
kfctl version: kfctl v0.7.0
Kubernetes platform: GKE
Kubernetes version: Client Version: version.Info{Major:“1”, Minor:“15”, GitVersion:“v1.15.6”, GitCommit:“7015f71e75f670eb9e7ebd4b5749639d42e20079”, GitTreeState:“clean”, BuildDate:“2019-11-13T11:20:18Z”, GoVersion:“go1.12.12”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“14+”, GitVersion:“v1.14.9-gke.2”, GitCommit:“0f206d1d3e361e1bfe7911e1e1c686bc9a1e0aa5”, GitTreeState:“clean”, BuildDate:“2019-11-25T19:35:58Z”, GoVersion:“go1.12.12b4”, Compiler:“gc”, Platform:“linux/amd64”}
OS (e.g. from /etc/os-release):
Inside Pod - NAME=“Ubuntu” VERSION=“16.04.5 LTS (Xenial Xerus)” ID=ubuntu ID_LIKE=debian PRETTY_NAME=“Ubuntu 16.04.5 LTS” VERSION_ID=“16.04” HOME_URL=“http://www.ubuntu.com/” SUPPORT_URL=“http://help.ubuntu.com/” BUG_REPORT_URL=“http://bugs.launchpad.net/ubuntu/” VERSION_CODENAME=xenial

gsutil timeout image gke-kubeflow-gstuil-timeout glcoud timeout image gke-kubeflow-glcoud-timeout

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 17 (4 by maintainers)

Most upvoted comments

We just filed a ticket with Google Support about this today. We see this problem on 1.15.4-gke.22 but not on ~1.14.9-gke.2~ 1.14.8-gke.18. We are using workload identity for our pipelines. The gke-metadata-server daemonset pods report:

getServiceAccountMetadata("workload-identity-test",
 "kubeflow-prod-user@myproject.iam.gserviceaccount.com"): 
 surfaceable error (status 500, message "Unable to generate access token"): 
getFederatedToken("myproject.svc.id.goog", pod): Post 
https://securetoken.googleapis.com/v1/identitybindingtoken: context canceled

We are able to temporarily resolve the issue by cycling the gke-metadata-server daemonset pods.

kubectl delete pods -n kube-system --selector=k8s-app=gke-metadata-server

Related Google Ticket: https://issuetracker.google.com/issues/146622472

louisvernon on Dec 31, 2019

@louisvernon I fixed it by downgrading the cluster to version 1.14.8-gke.18. I will keep checking the tickets you have left and will upgrade the cluster later. Thanks.

rky0930 on Jan 2, 2020

I got the same issue: https://github.com/kubeflow/pipelines/issues/2773

I think that is GKE’s issue. The workload identity is unstable now

bruce3557 on Dec 30, 2019