wandb: Unable to launch multiple runs with Joblib
Hi, thanks for the awesome library. I found it really helpful for my research!
However I have encountered some issues with trying to launch multiple runs with joblib.
from math import sqrt
from joblib import Parallel, delayed
import wandb
def f(x):
wandb.init(project="symppl", reinit=True)
for i in range(10):
loss = i
# Log metrics with wandb
# wandb.log({"Loss": loss})
wandb.finish()
return sqrt(x)
def main():
res = Parallel(n_jobs=2)(delayed(f)(i**2) for i in range(4))
print(res)
if __name__ == "__main__":
main()
This will error with the following exception.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/spawn.py", line 231, in prepare
set_start_method(data['start_method'], force=True)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/context.py", line 247, in set_start_method
self._actual_context = self.get_context(method)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/context.py", line 239, in get_context
return super().get_context(method)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/context.py", line 193, in get_context
raise ValueError('cannot find context for %r' % method) from None
ValueError: cannot find context for 'loky'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/spawn.py", line 231, in prepare
set_start_method(data['start_method'], force=True)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/context.py", line 247, in set_start_method
self._actual_context = self.get_context(method)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/context.py", line 239, in get_context
return super().get_context(method)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/context.py", line 193, in get_context
raise ValueError('cannot find context for %r' % method) from None
ValueError: cannot find context for 'loky'
Changing joblib’s backend to multiprocessing also causes an error.
Problem at: /Users/ethan/dev/wandb/run.py 6 f
Problem at: /Users/ethan/dev/wandb/run.py 6 f
Traceback (most recent call last):
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 575, in init
run = wi.init()
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 367, in init
backend.ensure_launched(
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/wandb/sdk/backend/backend.py", line 81, in ensure_launched
self.wandb_process.start()
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/process.py", line 118, in start
assert not _current_process._config.get('daemon'), \
AssertionError: daemonic processes are not allowed to have children
Traceback (most recent call last):
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 575, in init
run = wi.init()
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 367, in init
backend.ensure_launched(
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/wandb/sdk/backend/backend.py", line 81, in ensure_launched
self.wandb_process.start()
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/process.py", line 118, in start
assert not _current_process._config.get('daemon'), \
AssertionError: daemonic processes are not allowed to have children
wandb: ERROR Abnormal program exit
wandb: ERROR Abnormal program exit
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 575, in init
run = wi.init()
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 367, in init
backend.ensure_launched(
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/wandb/sdk/backend/backend.py", line 81, in ensure_launched
self.wandb_process.start()
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/process.py", line 118, in start
assert not _current_process._config.get('daemon'), \
AssertionError: daemonic processes are not allowed to have children
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/joblib/parallel.py", line 252, in __call__
return [func(*args, **kwargs)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/joblib/parallel.py", line 252, in <listcomp>
return [func(*args, **kwargs)
File "/Users/ethan/dev/wandb/run.py", line 6, in f
wandb.init(project="symppl", reinit=True)
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 612, in init
six.raise_from(Exception("problem"), error_seen)
File "<string>", line 3, in raise_from
Exception: problem
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "run.py", line 19, in <module>
main()
File "run.py", line 15, in main
res = Parallel(n_jobs=2, backend="multiprocessing")(delayed(f)(i**2) for i in range(4))
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/joblib/parallel.py", line 1042, in __call__
self.retrieve()
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
Exception: problem
/Users/ethan/anaconda3/envs/symppl/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 12 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
My workflow consists of sweeping over hyper-parameters with hydra. This currently blocks me from using Hydra with a parallel launcher (e.g. joblib). I think this would also block support for hydra multiruns; see https://github.com/wandb/client/issues/1233#issuecomment-693139976
Is there a way that I can work around this issue?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 18 (7 by maintainers)
Hey @tiborsekera with the current release of the client library you can set the following environment variable so all works with joblib:
os.environ["WANDB_START_METHOD"] = "thread"@ethanluoyc
Thanks for reporting this. As a temporary workaround until we fix this you can do:
The only other change is that you will need in your sample script is to change wandb.finish() to wandb.join()
We are using a different approach in 0.10.x that appears to be conflicting with joblib’s multiprocessing implementations.
@tiborsekera @fses91 good to hear. We’ve updated our documentation here: https://docs.wandb.ai/guides/track/advanced/distributed-training#hanging-at-the-beginning-of-training
I was able to reproduce the issue and made a hot fix. We’ll need to put this through some more rigorous testing, but if @varun19299 or @ethanluoyc want to give joblib a try with 0.10.x you can install my hotfix branch with: