distributed: Are reference cycles a performance problem?

@gjoseph92 noticed that, under some profiling conditions, turning off garbage collection had a significant impact on scheduler performance. I’m going to include some notes from him in the summary below

Notes from Gabe

See https://github.com/dask/distributed/discussions/4825 for initial discussion of the problem. It also comes up on https://github.com/dask/distributed/issues/4881#issuecomment-859208156.

I’ve also run these with GC debug mode on (https://github.com/gjoseph92/dask-profiling-coiled/commit/c0ea2aa1) and looked at GC logs. Interestingly GC debug mode generally reports GC as taking zero time:

gc: done, 0 unreachable, 0 uncollectable, 0.0000s elapsed

Some of those logs are here: https://rawcdn.githack.com/gjoseph92/dask-profiling-coiled/61fc875173a5b2f9195346f2a523cb1d876c48ad/results/cython-shuffle-gc-debug-noprofiling-ecs-prod-nopyspy.txt?raw=true

The types of objects being listed as collectable are interesting (cells, frames, tracebacks, asyncio Futures/Tasks, SelectorKey) since those are the sorts of things you might expect to create cycles. It’s also interesting that there are already ~150k objects in generation 3 before the computation has even started, and ~300k (and growing) once it’s been running for a little bit.

I’ve also tried turning off:

  • statistical profiling
  • bokeh dashboard
  • uvloop instead of native asyncio

But none of those affected the issue.

What I wanted to do next was use refcycle or objgraph or a similar tool to try to see what’s causing the cycles. Or possibly use tracemalloc + GC hooks to try to log where the objects that were being collected were initially created.

I notice that we have reference cycles in our scheduler state

In [1]: from dask.distributed import Client
In [2]: client = Client()
In [3]: import dask.array as da
In [4]: x = da.random.random((1000, 1000)).sum().persist()
In [5]: s = client.cluster.scheduler
In [6]: a, b = s.tasks.values()

In [7]: a
Out[7]: <TaskState "('sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" memory>

In [8]: b
Out[8]: <TaskState "('random_sample-sum-sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" released>

In [9]: a in b.dependents
Out[9]: True
In [10]: b in a.dependencies
Out[10]: True

Should we be concerned about our use of reference cycles?

cc @jakirkham @pitrou

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 25 (17 by maintainers)

Commits related to this issue

Most upvoted comments

@dhirschfeld thanks a ton these links—they’re very helpful. Something like https://github.com/python-trio/trio/pull/1805#issuecomment-730633206 is what we’ll want to do to debug this.

Also, I wouldn’t be surprised if our issue is similar to this Trio one (raising exceptions within frames that hold references to other exceptions/frames/traceback objects), with a similar solution. I’ve been suspicious of this since seeing how so many of the objects being collected in the debug logs are Cells, Tracebacks, Frames, etc. Intuitively you’d read that sort of code and think “how does adding a del as the last line of a function matter; the function is about to go out of scope anyway.” But it’s a different story entirely when something (the raised exception) holds a reference to that Frame.

the progressive increase of runtime over runs could be explained by a progressive increase of python objects that gc.collect() has to go through at every iteration - read: leak. Do you have a measure of that? Does it correlate with runtime? If positive, I suggest that you

  1. do one run (to make sure everything’s warm)
  2. run pympler to log the id of all objects in the interpreter
  3. run many more runs and observe the progressive slowdown
  4. run pympler again to find all the objects in the interpreter again, then remove those that were already present in (2)
  5. use pympler.refbrowser (warning: mortally slow) to find out what’s referencing them and why they aren’t being garbage collected