pipelines: Kubeflow Pipelines timing out on Azure deployment
What steps did you take:
- Run a in-house Kubeflow notebook template.
- Get a timeout when running the pipeline.
What happened:
When submitting a pipeline I’m getting a timeout, this has been happening sporadically.
# Run the pipeline on Kubeflow cluster
pipeline_run = (
kfp
.Client(host=f'{host}/pipeline', cookies=cookies)
.create_run_from_pipeline_func(
pipeline,
arguments={},
experiment_name=experiment_name,
namespace=namespace,
run_name=pipeline_name
)
)
/opt/conda/lib/python3.7/site-packages/kfp_server_api/rest.py in request(self, method, url, query_params, headers, body, post_params, _preload_content, _request_timeout)
236
237 if not 200 <= r.status <= 299:
--> 238 raise ApiException(http_resp=r)
239
240 return r
ApiException: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'content-length': '24', 'content-type': 'text/plain', 'date': 'Mon, 24 Aug 2020 20:24:57 GMT', 'server': 'envoy', 'x-envoy-upstream-service-time': '300028'})
HTTP response body: upstream request timeout
What did you expect to happen:
Consistency with the Kubeflow Pipelines API
Environment:
Azure AKS
How did you deploy Kubeflow Pipelines (KFP)?
KFP version: 1.0.0
KFP SDK version: 1.0.0
Anything else you would like to add:
I’m looking for a way to manipulate the tcp keepalive on Kubeflow pipelines, it’s hard to tell if this error is on Kubeflow pipelines or Argo. On the Kubeflow pipelines API these calls hanged for a while and never seem to release:
I0824 20:20:00.099206 6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072
I0824 20:21:58.067818 6 interceptor.go:29] /api.RunService/CreateRun handler starting
I0824 20:21:58.753968 6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072
I0824 20:23:58.117694 6 interceptor.go:29] /api.RunService/CreateRun handler starting
I0824 20:23:58.798816 6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072
We know the Azure platform has issues with the Kubernetes API and you have to tweak the tcp keep alive in the applications, so perhaps this could be a solution.
[Miscellaneous information that will assist in solving the issue.]
/kind bug
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (9 by maintainers)
So turns out this was an issue with AKS; I did develop a simple kubernetes api job to prove this:
https://github.com/maganaluis/k8s-api-golang/blob/time-out/analysis.md
The job has just about all the same library versions that the Kubeflow Pipelines API has, and it will use the Argo client to submit a small workflow mimicking what KFP does here. The job will stay idle for 5 minutes between submissions, which is normal behavior, you don’t expect the API to be used 100 percent of the time.
If you run this job without the Istio sidecar enabled the job will complete, this is because of the tcp keep alive settings on golang are fairly robust. However the istio sidecar will alter these settings and probably use the linux defaults.
This is not to say that the fault here is on the istio sidecar, because we did run the same job on AWS and GCP and the job passed without timeouts or connections dropped. Regardless the sidecar is required for multi-user so it must be enabled.
Istio provides you with a quite powerful tool which allows you to set the tcp keep alive setting going to any service in Kubernetes. You can read more about it here:
https://preliminary.istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings
So the solution here is to setup a destination rule for the kubernetes api:
I would like to see this rule at the Kubeflow installation level to ensure Kubeflow works on any cloud platform. This not only will solve timeouts with Kubeflow Pipelines but any other service that uses the Kubernetes API. Anyway, I’m leaving this here in case anyone else stumbles upon the same issue.
@Ark-kun @Bobgy @rmgogogo @dtzar
For contrast we had to modify the Jupyter Web API on Kubeflow to avoid timeouts by adding the code below before instantiating the kubernetes api client, this also solved issues with other applications relying on that API such as Airflow, and JupyterHub. We are wondering if this also will be an issue with KFP.