dask-cloudprovider: Not able to create Dask client using AzureMLCluster object

I have created a Dask cluster on Azure ML using the following API.

amlcluster = AzureMLCluster(ws,
                            vm_size="STANDARD_D1",
                            datastores=[Datastore.get(ws, "my_datastore")], 
                            environment_definition=ws.environments['AzureML-Dask-CPU'], 
                            initial_node_count=2, 
                            scheduler_idle_timeout=600,
                            vnet='some_vnet',
                            subnet='subnet1',
                            vnet_resource_group='some_rsrc_group',
                            ct_name="my_dask_cluster"
)

Once the cluster is created, if I try to print the variable amlcluster in Jupyter Lab, it throws the following error.

KeyError Traceback (most recent call last) /anaconda/envs/azureml_custom_py37/lib/python3.7/site-packages/IPython/core/formatters.py in call(self, obj) 916 method = get_real_method(obj, self.print_method) 917 if method is not None: –> 918 method() 919 return True 920

/anaconda/envs/azureml_custom_py37/lib/python3.7/site-packages/distributed/deploy/cluster.py in ipython_display(self, **kwargs) 361 from IPython.display import display 362 –> 363 data = {“text/plain”: repr(self), “text/html”: self.repr_html()} 364 display(data, raw=True) 365

/anaconda/envs/azureml_custom_py37/lib/python3.7/site-packages/distributed/deploy/cluster.py in repr(self) 389 self._cluster_class_name, 390 self.scheduler_address, –> 391 len(self.scheduler_info[“workers”]), 392 sum(w[“nthreads”] for w in self.scheduler_info[“workers”].values()), 393 )

KeyError: ‘workers’

After the error, it provides just a Dashboard Link. Not sure if it is supposed to print anything else.

image

If I try to create a Dask Client that alos fails:

client = Client(amlcluster)

This is how the library version looks like for me:

dask 2.20.0 py_0 dask-cloudprovider 0.4.1 <pip> dask-core 2.20.0 py_0 dask-glm 0.2.0 py_1 conda-forge dask-ml 1.6.0 py_0 conda-forge dask-xgboost 0.1.10 <pip>

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Huh, ok! Guess the problem is more recent than 2.20! Maybe there’s another line of code added somewhere elses. I’ll change the bump suggestion in #165 to 2.30.

Replicated! I downgraded distributed to 2.11, and I got this:

image

Opening new issue about bumping the version requirement.

@arnabbiswas1 what version of distributed do you have? Looks like my students had 2.11, and comparing 2.11 to 2.30, it looks like we might have an explanation:

image

See distributed/deploy/cluster.py diff here: https://github.com/dask/distributed/compare/2.11.0...2.30.1

I have three students who are getting the same thing, and we cannot for the life of us get it figured out. We’ve tried upgrading dask (to 2.30), dask-cloudprovider (0.4.1), upgrading jupyter, making sure widgets work, etc.

When the students try and print amlcluster, they get:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/opt/anaconda3/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    916             method = get_real_method(obj, self.print_method)
    917             if method is not None:
--> 918                 method()
    919                 return True
    920 
~/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py in _ipython_display_(self, **kwargs)
    339 
    340     def _ipython_display_(self, **kwargs):
--> 341         widget = self._widget()
    342         if widget is not None:
    343             return widget._ipython_display_(**kwargs)
~/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/azure/azureml.py in _widget(self)
    858         jupyter = HTML(jupyter_link)
    859 
--> 860         status = HTML(self._widget_status(), layout=Layout(min_width="150px"))
    861 
    862         if self._supports_scaling:
~/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/azure/azureml.py in _widget_status(self)
    765     def _widget_status(self):
    766         ### reporting proper number of nodes vs workers in a multi-GPU worker scenario
--> 767         nodes = len(self.scheduler_info["workers"])
    768 
    769         if self.use_gpu:
KeyError: 'workers'
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/opt/anaconda3/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()
~/opt/anaconda3/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    403                         if cls is not object \
    404                                 and callable(cls.__dict__.get('__repr__')):
--> 405                             return _repr_pprint(obj, self, cycle)
    406 
    407             return _default_pprint(obj, self, cycle)
~/opt/anaconda3/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    693     """A pprint that just redirects to the normal repr function."""
    694     # Find newlines and replace them with p.break_()
--> 695     output = repr(obj)
    696     lines = output.splitlines()
    697     with p.group():
~/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py in __repr__(self)
    367             self._cluster_class_name,
    368             self.scheduler_address,
--> 369             len(self.scheduler_info["workers"]),
    370             sum(w["nthreads"] for w in self.scheduler_info["workers"].values()),
    371         )
KeyError: 'workers'

And if they try and pass it to Client(), they get:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-7-3c5778c46797> in <module>
      1 from dask.distributed import Client
      2 
----> 3 c = Client(amlcluster)
~/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py in __init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, **kwargs)
    721             ext(self)
    722 
--> 723         self.start(timeout=timeout)
    724         Client._instances.add(self)
    725 
~/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py in start(self, **kwargs)
    894             self._started = asyncio.ensure_future(self._start(**kwargs))
    895         else:
--> 896             sync(self.loop, self._start, **kwargs)
    897 
    898     def __await__(self):
~/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    346     if error[0]:
    347         typ, exc, tb = error[0]
--> 348         raise exc.with_traceback(tb)
    349     else:
    350         return result[0]
~/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py in f()
    330             if callback_timeout is not None:
    331                 future = asyncio.wait_for(future, callback_timeout)
--> 332             result[0] = yield future
    333         except Exception as exc:
    334             error[0] = sys.exc_info()
~/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()
~/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
    989 
    990         try:
--> 991             await self._ensure_connected(timeout=timeout)
    992         except OSError:
    993             await self._close()
~/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py in _ensure_connected(self, timeout)
   1072         else:
   1073             msg = await comm.read()
-> 1074         assert len(msg) == 1
   1075         assert msg[0]["op"] == "stream-start"
   1076 
AssertionError: 

Meanwhile, on Azure they seem to have three happy active nodes, so seems like the cluster was created, there’s just a subsequent issue…

(@arnabbiswas1 you’re making me feel bad because that’s my site! And yet… I don’t know the problem! Sorry. 😦 )

It’s also extremely consistent – we spent an hour and a half trying to figure it out, and over and over we got this.

@arnabbiswas1 are you on the same vnet as the cluster trying to access it?

@quasiben @drabastomek fyi I setup a test which runs every 2 hours here: https://github.com/Azure/azureml-examples/actions?query=workflow%3Arun-tutorial-ud - it is mostly green, majority of the failures have been my fault