distributed: dask-ssh fails if Python is installed in different paths across the workers

I tested distributed on a very simple office “cluster” : My laptop and office server. Both are on Ubuntu 14.04, but I installed Python differently on both. In my laptop I did a user install of miniconda and in the server I installed anaconda as root. The corresponding python paths are :

10.1.0.115 --> ‘/home/aguirre/miniconda2/bin/python’ (My laptop) 10.1.0.118 --> ‘/opt/anaconda2/bin/python’ (Server)

If I manually launch dask-worker and dask-scheduler, everything works fine. But if I try dask-ssh, it does not work :

$ dask-ssh 10.1.0.{115,118}
---------------------------------------------------------------
                 Dask.distributed v1.11.0

Worker nodes:
  0: 10.1.0.115
  1: 10.1.0.118

scheduler node: 10.1.0.115:8786
---------------------------------------------------------------


[ scheduler 10.1.0.115:8786 ] : /home/aguirre/miniconda2/bin/python -m distributed.cli.dask_scheduler --port 8786
[ worker 10.1.0.115 ] : /home/aguirre/miniconda2/bin/python -m distributed.cli.dask_worker 10.1.0.115:8786 --host 10.1.0.115 --nthreads 0 --nprocs 1
[ worker 10.1.0.118 ] : /home/aguirre/miniconda2/bin/python -m distributed.cli.dask_worker 10.1.0.115:8786 --host 10.1.0.118 --nthreads 0 --nprocs 1
[ scheduler 10.1.0.115:8786 ] : distributed.scheduler - INFO - Scheduler at:           10.1.0.115:8786
[ scheduler 10.1.0.115:8786 ] : distributed.scheduler - INFO -      http at:           10.1.0.115:9786
[ scheduler 10.1.0.115:8786 ] : distributed.scheduler - WARNING - Could not start Bokeh web UI
[ scheduler 10.1.0.115:8786 ] : Traceback (most recent call last):
[ scheduler 10.1.0.115:8786 ] :   File "/home/aguirre/miniconda2/lib/python2.7/site-packages/distributed/cli/dask_scheduler.py", line 92, in main
[ scheduler 10.1.0.115:8786 ] :     bokeh_proc = subprocess.Popen(args)
[ scheduler 10.1.0.115:8786 ] :   File "/home/aguirre/miniconda2/lib/python2.7/subprocess.py", line 710, in __init__
[ scheduler 10.1.0.115:8786 ] :     errread, errwrite)
[ scheduler 10.1.0.115:8786 ] :   File "/home/aguirre/miniconda2/lib/python2.7/subprocess.py", line 1335, in _execute_child
[ scheduler 10.1.0.115:8786 ] :     raise child_exception
[ scheduler 10.1.0.115:8786 ] : OSError: [Errno 2] No such file or directory
[ worker 10.1.0.118 ] : bash: /home/aguirre/miniconda2/bin/python: No such file or directory
[ worker 10.1.0.118 ] : remote process exited with exit status 127

As you can see, the worker on 10.1.0.118 tries to call python on the wrong path (/home/aguirre/miniconda2/bin/python) which happens to be the path of the scheduler (10.1.0.115)

I took a look at the code and I think the problem lies on the line 189 of cluster.py. It builds the command to be launched by each worker with the path of the node where dask-ssh was launched. Just to check, I hard-coded the python path of 10.1.0.118 on line 189 of cluster.py, and it correctly launches the worker ! However, it now fails to launch a worker on 10.1.0.115, which is normal…

BTW, I don’t think the Exception raised by the scheduler (10.1.0.115) is related… it seems that it does not find bokeh in the PATH… However, when I launch the scheduler by itself, it does manage to launch bokeh web UI. But lets handle one problem at a time and focus on the Python PATH bit of my case.

I don’t have many clues on how this could be solved, but with some guidance, I’m willing to give a hand !

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 15 (9 by maintainers)

Commits related to this issue

Most upvoted comments

@felipeam86 I see two starting options:

  1. Add notes to the documentation dask/docs/source/...rst that dask-ssh is assuming similar environments, such as you might see on a system with a shared file system.
  2. Play with paramiko and learn how to create a connection that respects user environments. This probably involves some googling, some doc reading, and some experimentation on your own two-machine cluster setup. Then play with the implementation in dask/cluster.py to implement the changes that you needed in order to make things work well in experiments.