kubeflow: [GCP] CLI deployment (kfctl) fails to create cloud endpoint correctly

I performed (multiple times) the deployment using the description given at https://www.kubeflow.org/docs/gke/deploy/deploy-cli/. Everything seems to run fine but when I try to reach the endpoint <kfapp>.endpoints.<project>.cloud.goog I get in the browser DNS_PROBE_FINISHED_NXDOMAIN. Indeed, looking into endpoints in the cloud console or CLI nothing is there.

I checked the pod logs for cloud-endpoints-controller-... and get the following log lines repeating every second or so:

2019/05/09 09:44:49 [DEBUG][<kfapp>] Changed because parent sig different
2019/05/09 09:44:49 [DEBUG][<kfapp>] Changed because ingress target IP changed
2019/05/09 09:44:50 [INFO][<kfapp>] Service does not yet exist, creating: <kfapp>.endpoints.<project>.cloud.goog
2019/05/09 09:44:51 [ERROR] Could not sync state: [ERROR] Failed to creat Cloud Endpoints service: serviceName: <kfapp>.endpoints.<project>.cloud.goog, err: googleapi: Error 400: Service <kfapp>.endpoints.<project>.cloud.goog has been deleted and will be purged after 30 days. To reuse this service, please undelete the service following https://cloud.google.com/service-management/create-delete., failedPrecondition

I removed kfapp and project names on purpose for potential security reasons; they are given in a correct form; nothing complains at least.

I don’t see any issues before, access denied etc. There are no errors during the kfctl run as well. I also tried specifying version -v v0.5.0 and -v v0.5.1, but both give the same result.

Before I tried the web UI deployment and it worker, but I wanted to customize deployment and test different machine pools settings.

Not sure if it’s relevant, but I run it on Windows 10 WSL Ubuntu, thus, at least theoretically from the application perspective, Linux.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (7 by maintainers)

Most upvoted comments

The finding so far, aside the issue quota which is not occuring at the moment, is that if one creates a deployment with one kfapp name, let say kubeflow, than first deployment works fine and endpoint gets created, but if we do:

kfctl delete all -V
# .... do something or nothing here, doesn't matter really...
kfctl apply all -V

Than the second deployment goes through fine, but the endpoint doesn’t get created, hence, it’s quite unusable, at least easily.

Further inspections shows that iap-enabler keeps crashing:

$ kubectl get pods
NAME                                                        READY   STATUS             RESTARTS   AGE
...
gcp-cred-webhook-6fbfc849c8-dnffj                           1/1     Running            0          55m
iap-enabler-99779c66d-m9wjk                                 0/1     CrashLoopBackOff   15         55m
ingress-bootstrap-52xg4                                     1/1     Running            0          55m
...

The endpoint-controller logs, despite working, look as follows:

$ kubectl logs --tail 10 cloud-endpoints-controller-5888c755cb-pv9pq
2019/05/29 12:08:13 [INFO][kubeflow] Service does not yet exist, creating: <kfapp>.endpoints.<project>.cloud.goog
2019/05/29 12:08:14 [ERROR] Could not sync state: [ERROR] Failed to creat Cloud Endpoints service: serviceName: <kfapp>.endpoints.<project>.cloud.goog, err: googleapi: Error 400: Service <kfapp>.endpoints.<project>.cloud.goog has been deleted and will be purged after 30 days. To reuse this service, please undelete the service following https://cloud.google.com/service-management/create-delete., failedPrecondition
2019/05/29 12:08:15 [DEBUG][kubeflow] Changed because parent sig different
2019/05/29 12:08:15 [DEBUG][kubeflow] Changed because ingress target IP changed

The list of endpoint services is empty, but I tried undeleting the service, with the below command, which returns an error:

$ gcloud endpoints services undelete <kfapp>.endpoints.<project>.cloud.goog
Waiting for async operation operations/services.<kfapp>.endpoints.<project>.cloud.goog-3 to complete...
ERROR: (gcloud.endpoints.services.undelete) The operation with ID s<kfapp>.endpoints.<project>.cloud.goog-3 resulted in a failure.

After the above operation the endpoint-controller logs show:

$ kubectl logs --tail 10 cloud-endpoints-controller-5888c755cb-pv9pq -f
2019/05/29 13:34:21 [INFO][kubeflow] Endpoint service already exists, skipping create.
2019/05/29 13:34:21 [INFO][kubeflow] Current state: ENDPOINT_CREATE_PENDING
2019/05/29 13:34:23 [INFO][kubeflow] Create pending
2019/05/29 13:34:23 [INFO][kubeflow] Endpoint created: <kfapp>.endpoints.<project>.cloud.goog, submitting endpoint config.
2019/05/29 13:34:24 [INFO][kubeflow] Current state: ENDPOINT_SUBMIT_PENDING
2019/05/29 13:34:25 [INFO][kubeflow] Service config submit complete for endpoint <kfapp>.endpoints.<project>.cloud.goog, config: 2019-05-29r1
2019/05/29 13:34:26 [INFO][kubeflow] Creating endpoint service config rollout for: endpoint: <kfapp>.endpoints.<project>.cloud.goog, config: 2019-05-29r1
2019/05/29 13:34:27 [INFO][kubeflow] Current state: ENDPOINT_ROLLOUT_PENDING
2019/05/29 13:34:45 [INFO][kubeflow] Service config rollout complete for: endpoint: <kfapp>.endpoints.<project>.cloud.goog, config: 2019-05-29r1
2019/05/29 13:34:45 [INFO][kubeflow] Current state: IDLE

And after undeleting and waiting for a while the endpoint is again working.

Hence there should be either a deployment or endpoint-controller command to create or undelete the endpoint, as the creating recently deleted endpoint doesn’t seem to work now.

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.79. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

The correct fix is for the cloud-endpoints controller to handle the case where the endpoint is being deleted and then undeletes it.

The cloud-endpoints controller we are using i s set here https://github.com/kubeflow/manifests/blob/ffede944f18343271f526bd217cde2edbe6e0e38/gcp/cloud-endpoints/base/deployment.yaml#L13

gcr.io/cloud-solutions-group/cloud-endpoints-controller:0.2.1

Source is here: https://github.com/danisla/cloud-endpoints-controller

I don’t see any logic in the controller to deal with this use case https://github.com/danisla/cloud-endpoints-controller/blob/master/cmd/cloud-endpoints-controller/main.go

So it looks like we still need to fix the controller to work with this.

A longer term solution might be for Kubernetes Cloud Connector https://github.com/GoogleCloudPlatform/k8s-config-connector

To support Cloud Endpoints