kubeflow: Istio ingress fails to reach ready state on EKS, statuscode 503

/kind bug

Upon deploying Kubeflow 0.7.0 on an EKS Cluster with Kubernetes 1.14 (following this guide: https://www.kubeflow.org/docs/aws/deploy/install-kubeflow/) all services start up correctly and kubectl get ingress -n istio-system returns an address after a few minutes. However, the address is not accessible. kubectl -n istio-system describe pod/istio-ingressgateway-xxxxxxxxxx-yyyy then yields:

Events:
  Type     Reason       Age               From                                    Message
  ----     ------       ----              ----                                    -------
  Normal   Scheduled    2m                default-scheduler                       Successfully assigned istio-system/istio-ingressgateway-565b894b5f-b56pz to ip-192-168-24-84.ec2.internal
  Warning  FailedMount  2m                kubelet, ip-192-168-24-84.ec2.internal  MountVolume.SetUp failed for volume "istio-ingressgateway-service-account-token-tzgwg" : couldn't propagate object cache: timed out waiting for the condition
  Warning  FailedMount  2m                kubelet, ip-192-168-24-84.ec2.internal  MountVolume.SetUp failed for volume "ingressgateway-ca-certs" : couldn't propagate object cache: timed out waiting for the condition
  Warning  FailedMount  2m                kubelet, ip-192-168-24-84.ec2.internal  MountVolume.SetUp failed for volume "ingressgateway-certs" : couldn't propagate object cache: timed out waiting for the condition
  Normal   Pulled       2m                kubelet, ip-192-168-24-84.ec2.internal  Container image "docker.io/istio/proxyv2:1.1.6" already present on machine
  Normal   Created      2m                kubelet, ip-192-168-24-84.ec2.internal  Created container istio-proxy
  Normal   Started      2m                kubelet, ip-192-168-24-84.ec2.internal  Started container istio-proxy
  Warning  Unhealthy    1m (x19 over 2m)  kubelet, ip-192-168-24-84.ec2.internal  Readiness probe failed: HTTP probe failed with statuscode: 503

I am using kfctl_aws.0.7.0.yaml, not the cognito version.

Environment:

  • Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): 0.7.0
  • kfctl version: (use kfctl version): 0.7.0
  • Kubernetes platform: EKS
  • Kubernetes version: (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.9", GitCommit:"16236ce91790d4c75b79f6ce96841db1c843e7d2", GitTreeState:"clean", BuildDate:"2019-03-27T14:42:18Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.8-eks-b7174d", GitCommit:"b7174db5ee0e30c94a0b9899c20ac980c0850fc8", GitTreeState:"clean", BuildDate:"2019-10-18T17:56:01Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 20 (4 by maintainers)

Most upvoted comments

@karlschriek does this look more to do with Istio? I think it would be better to reach out to https://github.com/istio/istio/ regarding this issue.

I undestand the reasoning that this should probably be addressed by the Istio team, but on the other hand Kubeflow is an ecosystem consisting of a great many different such projects.

From my perspective Kubeflow is interesting to the community of Machine Learning developers exactly because it promises to abstract away a lot of the underlying complexity. That promise gets watered down quite a lot when the responsibility for ensuring that the ecosystem works simply gets delegated to the projects Kubeflow is built on!

Also, quite possibly the problem is already solved with a newer version (1.3.5) of Istio (which @chanwit’s comment above seems to suggest), but the default Kubeflow distribution is currently locked to 1.1.6 Based on my tests, it isn’t addressed by 1.3.1 though (which was the latest version being discussed as becoming the new standard version). It is probably quite important for the Kubeflow team to know this.

My way of installing Istio 1.3.5 is not the standard way used by kfctl Basically, I removed all Istio deployment, svc and manually did it via Helm.

don’t mind me. I upgraded Istio to 1.3.5 via Helm and it’s working fine now.

@chanwit @karlschriek In this case, could you guys please share the istio-system pods status? It’s possible that Ingressgateway is blocked by Pilot, which is blocked by galley which is blocked by citadel, etc. Let’s make sure your istio control plane is ready and healthy.