dask-kubernetes: KubeCluster(scheduler_service_wait_timeout) parameter is not working.

Hi,

Thank you for providing Dask functionality on Kubernetes.

I am encountering a problem with the KubeCluster() method whilst deploying in mode=remote, by passing a scheduler_pod_template of type kubernetes.client.v1pod, and configuring Dask with the scheduler service of type LoadBalancer.

This issue refers to the scheduler_service_wait_timeout parameter in particular. When setting the value to 300 (as in three hundred seconds and type of int), the KubeCluster() method still times out after 30s and I get an error, even if my scheduler Pod and Service are running without issue. Regardless of which value I pass to the parameter, I timeout after 30 seconds. This timeout triggers a cleanup and kills the Pod and Service in my EKS cluster accordingly.

I’ve read through the code, and I do not see where this timeout issue actually occurs within the dask-kubernetes package, as it appears to be defined in the dask.distributed (in partciular in distributed/deploy/spec.py) package specifically.

What happened:

I get a Timeout error after 30 seconds.

What you expected to happen:

I (hopefully) do not get a Timeout error since provisioning an ELB in EKS should normally take roughly 120 seconds. Ideally, I get no errors, but in this case, proof that the service timeout parameter works is a valid proof of having solved the issue.

Minimal Complete Verifiable Example:



from kubernetes import client as k8sclient
from dask_kubernetes import KubeCluster, KubeConfig

scheduler_pod = k8sclient.V1Pod(
                    metadata=k8sclient.V1ObjectMeta(annotations={}),
                    spec=k8sclient.V1PodSpec(
                        containers=[
                            k8sclient.V1Container(
                                name="scheduler",
                                image="daskdev/dask:latest",
                                args=[
                                    "dask-scheduler"
                                ])],
                        tolerations=[
                            k8sclient.V1Toleration(
                                effect="NoSchedule",
                                operator="Equal",
                                key="nodeType",
                                value="scheduler")]
                    )
                )

dask.config.set({"kubernetes.scheduler-service-type": "LoadBalancer"})
dask.config.set({"kubernetes.scheduler-service-wait-timeout": 300})

auth = KubeConfig(config_file="~/.kube/config")

cluster = KubeCluster(pod_template="worker.yaml",
                      namespace='dask',
                      deploy_mode="remote",
                      n_workers=0,
                      scheduler_service_wait_timeout=300,
                      scheduler_pod_template=scheduler_pod)

Anything else we need to know?:

My EKS Cluster is Private. The Jupyter Lab machines are in a different subnet from the EKS cluster, but in the same VPC. Deployment mode is set to remote with KubeCluster().

We do not want to use Port-Forwarding from the Jupyter Lab machines.

Environment:

  • Dask version: Docker image is daskdev/dask:latest
  • Python version: 3.8.11
  • Operating System: Ubuntu 20.04.2 LTS
  • Install method (conda, pip, source): Conda

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 22 (8 by maintainers)

Most upvoted comments

@jacobtomlinson I can now confirm that I was able to get the KubeCluster() method to work. Further testing on functionality is now taking place. I had to change some network firewall rules to make this work in the end and avoid the TimeOutError.

Many thanks for the help!

Your template isn’t quite right. annotations isn’t a top-level key, it should be under the metadata key. So it should probably look like this.

dask.config.set(
    {
        "kubernetes.scheduler-service-template": {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "annotations": {
                    "service.beta.kubernetes.io/aws-load-balancer-internal": "true",
                },
            },
            "spec": {
                "selector": {
                    "dask.org/cluster-name": "",
                    "dask.org/component": "scheduler",
                },
                "ports": [
                    {
                        "name": "comm",
                        "protocol": "TCP",
                        "port": 8786,
                        "targetPort": 8786,
                    },
                    {
                        "name": "dashboard",
                        "protocol": "TCP",
                        "port": 8787,
                        "targetPort": 8787,
                    },
                ],
            },
        }
    }
)

Can you confirm that the Python and Dask versions are the same in the Jupyter Notebook and the container image that you are using (looks like the default daskdev/dask:latest)?

Ah ok, that looks like AWS is provisioning the ELB but then the ELB takes time to actually become responsive. So the Dask Kubernetes service is being created successfully and then Distributed is timing out trying to connect to the ELB.

Instead of setting the Dask Kubernetes timeout you should be setting distributed.comm.timeouts.connect.

Thanks for raising this, I’ll dig into it. Can you please share the timeout error too?