distributed: dask-ssh fails if Python is installed in different paths across the workers
I tested distributed on a very simple office “cluster” : My laptop and office server. Both are on Ubuntu 14.04, but I installed Python differently on both. In my laptop I did a user install of miniconda and in the server I installed anaconda as root. The corresponding python paths are :
10.1.0.115 --> ‘/home/aguirre/miniconda2/bin/python’ (My laptop) 10.1.0.118 --> ‘/opt/anaconda2/bin/python’ (Server)
If I manually launch dask-worker and dask-scheduler, everything works fine. But if I try dask-ssh, it does not work :
$ dask-ssh 10.1.0.{115,118}
---------------------------------------------------------------
Dask.distributed v1.11.0
Worker nodes:
0: 10.1.0.115
1: 10.1.0.118
scheduler node: 10.1.0.115:8786
---------------------------------------------------------------
[ scheduler 10.1.0.115:8786 ] : /home/aguirre/miniconda2/bin/python -m distributed.cli.dask_scheduler --port 8786
[ worker 10.1.0.115 ] : /home/aguirre/miniconda2/bin/python -m distributed.cli.dask_worker 10.1.0.115:8786 --host 10.1.0.115 --nthreads 0 --nprocs 1
[ worker 10.1.0.118 ] : /home/aguirre/miniconda2/bin/python -m distributed.cli.dask_worker 10.1.0.115:8786 --host 10.1.0.118 --nthreads 0 --nprocs 1
[ scheduler 10.1.0.115:8786 ] : distributed.scheduler - INFO - Scheduler at: 10.1.0.115:8786
[ scheduler 10.1.0.115:8786 ] : distributed.scheduler - INFO - http at: 10.1.0.115:9786
[ scheduler 10.1.0.115:8786 ] : distributed.scheduler - WARNING - Could not start Bokeh web UI
[ scheduler 10.1.0.115:8786 ] : Traceback (most recent call last):
[ scheduler 10.1.0.115:8786 ] : File "/home/aguirre/miniconda2/lib/python2.7/site-packages/distributed/cli/dask_scheduler.py", line 92, in main
[ scheduler 10.1.0.115:8786 ] : bokeh_proc = subprocess.Popen(args)
[ scheduler 10.1.0.115:8786 ] : File "/home/aguirre/miniconda2/lib/python2.7/subprocess.py", line 710, in __init__
[ scheduler 10.1.0.115:8786 ] : errread, errwrite)
[ scheduler 10.1.0.115:8786 ] : File "/home/aguirre/miniconda2/lib/python2.7/subprocess.py", line 1335, in _execute_child
[ scheduler 10.1.0.115:8786 ] : raise child_exception
[ scheduler 10.1.0.115:8786 ] : OSError: [Errno 2] No such file or directory
[ worker 10.1.0.118 ] : bash: /home/aguirre/miniconda2/bin/python: No such file or directory
[ worker 10.1.0.118 ] : remote process exited with exit status 127
As you can see, the worker on 10.1.0.118 tries to call python on the wrong path (/home/aguirre/miniconda2/bin/python) which happens to be the path of the scheduler (10.1.0.115)
I took a look at the code and I think the problem lies on the line 189 of cluster.py. It builds the command to be launched by each worker with the path of the node where dask-ssh was launched. Just to check, I hard-coded the python path of 10.1.0.118 on line 189 of cluster.py, and it correctly launches the worker ! However, it now fails to launch a worker on 10.1.0.115, which is normal…
BTW, I don’t think the Exception raised by the scheduler (10.1.0.115) is related… it seems that it does not find bokeh in the PATH… However, when I launch the scheduler by itself, it does manage to launch bokeh web UI. But lets handle one problem at a time and focus on the Python PATH bit of my case.
I don’t have many clues on how this could be solved, but with some guidance, I’m willing to give a hand !
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 15 (9 by maintainers)
Commits related to this issue
- Add section to docs about worker environments Fixes https://github.com/dask/distributed/issues/341 — committed to mrocklin/distributed by mrocklin 8 years ago
- Add section to docs about worker environments (#424) Fixes https://github.com/dask/distributed/issues/341 — committed to dask/distributed by mrocklin 8 years ago
@felipeam86 I see two starting options:
dask/docs/source/...rstthatdask-sshis assuming similar environments, such as you might see on a system with a shared file system.paramikoand learn how to create a connection that respects user environments. This probably involves some googling, some doc reading, and some experimentation on your own two-machine cluster setup. Then play with the implementation indask/cluster.pyto implement the changes that you needed in order to make things work well in experiments.