distributed: Potential race condition in Nanny

Hi everybody, since a few days we’re seeing “random” failures in our CI due to distributed emitting:

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 1>>
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

This feels like a race condition in some situation, e.g. closing of the Nanny because the periodic callbacks are still running but Nanny.process is already None.

I think (still investigating) we’ve started seeing this only after 2.20 was released.

I cannot attach an MFE yet simply because we don’t have one 😬 we only experience this on CI. I’d welcome any sort of feedback. I can try to give some context though: the error is triggered in a Jupyter Notebook by a cell which calls scipy.minimize from a DASK worker (cell 14 here – look for optimize.minimize). I doubt it’s ever going to be useful, but here’s an excerpt of the raw log (part of which I pasted above): note that it repeats over and over again for hundreds/thousands of lines…

Thanks!

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 2
  • Comments: 28 (11 by maintainers)

Most upvoted comments

I just ran into this issue on my local machine, so it is alive and well.

from distributed import Client
client = Client(asynchronous=False)

Results in this being outputted over and over again:

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
  File "/home/arthur/.local/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/home/arthur/.local/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

Known Causes:

  • Running this when os.getcwd() is in a directory without write privileges triggers the error every time.
  • My Django project via PyCharm’s console also triggers it every time.
    • Investigating why is still ongoing.

Version Information

  • help(distributed) gives version 2.30.0
  • Ubuntu 20.04.1 LTS
  • Kernel 5.4.0-48-generic
  • Dask installed via pip3 install dask[complete] as root
  • free -h:
              total        used        free      shared  buff/cache   available
Mem:           15Gi        10Gi       318Mi       762Mi       4.8Gi       4.1Gi
Swap:          31Gi       1.0Gi        31Gi

@fjetter thanks a bunch for the help. It looks like my conda environment displayed in the upper right of the notebook wasn’t actually active for some reason. I ran the following in ipython instead with the environment actually active:

import distributed
cluster = distributed.LocalCluster(nworkers=2)

and was able to get a more informative traceback with the main branch installed

TypeError: __init__() got an unexpected keyword argument 'nworkers'

using the correct arg n_workers solved my problem.

I’ve been able to turn this error on if I set silence_logs='error'.

This example doesn’t work on my local:

import contextlib
from concurrent.futures._base import as_completed

import dask
import distributed
from distributed.cfexecutor import ClientExecutor


@contextlib.contextmanager
def executor():
    with dask.config.set({"distributed.worker.daemon": True}):
        with distributed.LocalCluster(
                n_workers=5,
                processes=True,
                silence_logs='error'
        ) as cluster:
            with distributed.Client(cluster, ) as client:
                yield ClientExecutor(client)

def add(z):
    return z+ 1

if __name__ == '__main__':

    with executor() as pool:
        futures = [pool.submit(add, k) for k in range(10)]
        for f in as_completed(futures):
            print(f.result())

Failing with:

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
  File "/home/cjwright/mc/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/home/cjwright/mc/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

This works though:

import contextlib
from concurrent.futures._base import as_completed

import dask
import distributed
from distributed.cfexecutor import ClientExecutor


@contextlib.contextmanager
def executor():
    with dask.config.set({"distributed.worker.daemon": True}):
        with distributed.LocalCluster(
                n_workers=5,
                processes=True,
                # silence_logs='error'
        ) as cluster:
            with distributed.Client(cluster, ) as client:
                yield ClientExecutor(client)

def add(z):
    return z+ 1

if __name__ == '__main__':

    with executor() as pool:
        futures = [pool.submit(add, k) for k in range(10)]
        for f in as_completed(futures):
            print(f.result())

Does that work for a reproducible example?

(dask version 2020.12.0)