dask-kubernetes: Dask Operator becomes unresponsive after ~1hr

This one’s a bit tricky. I haven’t been able to reproduce it in a kind or minikube cluster - it only happens in our hosted environments (i.e., AKS, EKS, etc.). The dask operator deployment works great for a while and then becomes unresponsive. I’ve seen it happen after being up for 1hr and 2hrs (pretty much on the nose).

The status of the pod is running but the logs are frozen and the operator is unresponsive to KubeCluster instantiation. Has anyone seen this? If not, I’d really appreciate any guidance on how to efficiently narrow in on the root cause (i.e., increasing log levels, inspecting heartbeats, operator health/status queries, etc.)

Anything else we need to know?:

Note that resetting the dask operator deployment via kubectl rollout restart deployment dask-kubernetes-operator -n dask-operator recovers the operator after about 60+s (which is how long it takes to terminate the pod once frozen).

Here are the final lines of the dask-kubernetes-operator pod logs:

[2022-11-28 00:45:46,745] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Updating diff: (('change', ('status', 'phase'), 'Created', 'Running'),)
[2022-11-28 00:45:46,745] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Patching with: <redacted>
[2022-11-28 00:45:46,754] kubernetes_asyncio.c [DEBUG   ] response body: <redacted>
[2022-11-28 00:45:46,764] kubernetes_asyncio.c [DEBUG   ] response body: <redacted>
[2022-11-28 00:45:46,764] kopf.objects         [INFO    ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Successfully adopted by sedaro-dask
[2022-11-28 00:45:46,784] kubernetes_asyncio.c [DEBUG   ] response body: <redacted>
[2022-11-28 00:45:46,801] kubernetes_asyncio.c [DEBUG   ] response body: <redacted>
[2022-11-28 00:45:46,805] kopf.objects         [INFO    ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Scaled worker group sedaro-dask-default up to 1 workers.
[2022-11-28 00:45:46,805] kopf.objects         [INFO    ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Handler 'daskworkergroup_create' succeeded.
[2022-11-28 00:45:46,806] kopf.objects         [INFO    ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Creation is processed: 1 succeeded; 0 failed.
[2022-11-28 00:45:46,806] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Patching with: <redacted>
[2022-11-28 00:45:46,858] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Something has changed, but we are not interested (the essence is the same).
[2022-11-28 00:45:46,858] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Handling cycle is finished, waiting for new changes.
[2022-11-28 00:45:46,927] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Something has changed, but we are not interested (the essence is the same).
[2022-11-28 00:45:46,927] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Handling cycle is finished, waiting for new changes.
[2022-11-28 00:46:32,107] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Deleted, really deleted, and we are notified.
[2022-11-28 00:46:32,152] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Deleted, really deleted, and we are notified.

Environment:

  • Dask version: 2022.10.1
  • Python version: 3.9.15
  • Operating System: Linux
  • Install method (conda, pip, source): pip

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 23 (22 by maintainers)

Most upvoted comments

Awesome, thanks @jacobtomlinson. I will get this tested and closed ASAP!

@baswelsh once #626 passes CI I’ll merge it and release 2022.11.2 so you can try it out.