ray: Running ray with pytorch lightning in slurm job causes falure with error "ValueError: signal only works in main thread"

Hi,

I am a newbie trying to integrate ray with pytroch lightning. I followed the instructions at https://docs.ray.io/en/master/tune/tutorials/tune-pytorch-lightning.html when setting up hyperparam tuning with ray. However, I encountered 2 issues while using ray.

ISSUE 1

Importing pytorch_lightning seems to throw an error with 0.8.7 version.

Cause: from ray.tune.integration.pytorch_lightning import TuneReportCallback, TuneReportCheckpointCallback

Error: ModuleNotFoundError: No module named 'ray.tune.integration.pytorch_lightning'

Module versions: ray 0.8.7 tensorflow 2.1.0 python 3.7.4

ISSUE 1: FIX

I fixed this by installing ray using a wheel. ray 0.9.0.dev0

ISSUE 2

With the new ray version, when I submit a slurm job to run the tuning I get the following error:

ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train() (pid=4432, ip=172.26.92.190) File "/home/user/.local/lib/python3.7/site-packages/ray/tune/function_runner.py", line 227, in run self._entrypoint() File "/home/user/.local/lib/python3.7/site-packages/ray/tune/function_runner.py", line 290, in entrypoint self._status_reporter.get_checkpoint()) File "/home/user/.local/lib/python3.7/site-packages/ray/tune/function_runner.py", line 497, in _trainable_func output = train_func(config) File "tune.py", line 261, in train_run trainer.fit(model) File "/home/user/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn result = fn(self, *args, **kwargs) File "/home/user/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit results = self.accelerator_backend.train(model) File "/home/user/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train results = self.trainer.run_pretrain_routine(model) File "/home/user/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1184, in run_pretrain_routine self.register_slurm_signal_handlers() File "/home/user/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 240, in register_slurm_signal_handlers signal.signal(signal.SIGUSR1, self.sig_handler) File "/usr/local/easybuild-2019/easybuild/software/mpi/gcc/8.3.0/openmpi/3.1.4/python/3.7.4/lib/python3.7/signal.py", line 47, in signal handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler)) ValueError: signal only works in main thread

Can I get some advice on how to proceed after this?

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (6 by maintainers)

Commits related to this issue

disable auto-requeu of slurm see - https://github.com/Lightning-AI/lightning/issues/3651 - https://github.com/Lightning-AI/lightning/issues/6389 - https://github.com/ray-project/ray/issues/10995 — committed to thesofakillers/nlgoals by thesofakillers a year ago
additional hack to remove signal value error thing https://github.com/ray-project/ray/issues/10995#issuecomment-698177711 — committed to thesofakillers/nlgoals by thesofakillers a year ago

Most upvoted comments

Can you try this hack? Add

os.environ["SLURM_JOB_NAME"] = "bash"

to your Python script?

richardliaw on Sep 24, 2020

I also have this problem, downgrading PTL to 1.4.8 seems to solve it. Pretty sure something in 1.5.0 broke it, but I don’t know if 1.5.1 fixes it.

Same issue. PTL 1.4.8 works, 1.5.1 and 1.5.2 do not.

Thank you sooooo much.

I also encountered the same issue ValueError: signal only works in main thread of the main interpreter while following the tutorial, Using PyTorch Lightning with Tune.

The problem was finally solved by downgrading PTL from 1.5.2 to 1.4.8.

Package manager：

conda 4.10.1

Module Version and the Change:

pytorch 1.10.0
pytorch-lightning 1.5.2 => 1.4.8
ray 1.9.0
python 3.9.7

CRISPRCas on Dec 11, 2021

Is this an issue with Ray or an issue with Lightning? I’m having the same problem

import-antigravity on Mar 6, 2021

Same issue. PTL 1.4.8 works, 1.5.1 and 1.5.2 do not.

bkrosenz on Nov 17, 2021

I also have this problem, downgrading PTL to 1.4.8 seems to solve it. Pretty sure something in 1.5.0 broke it, but I don’t know if 1.5.1 fixes it.

offendo on Nov 11, 2021

Running into the same issue, except using Weights and Biases, and the posted solution does not work. Have you guys determined whether this is a PL or Ray issue?

maxwass on Sep 18, 2021

Hi @richardliaw ,

The hack seems to have fixed it.

Thank you!

rashindrie on Sep 24, 2020