ray: Running ray with pytorch lightning in slurm job causes falure with error "ValueError: signal only works in main thread"

Hi,

I am a newbie trying to integrate ray with pytroch lightning. I followed the instructions at https://docs.ray.io/en/master/tune/tutorials/tune-pytorch-lightning.html when setting up hyperparam tuning with ray. However, I encountered 2 issues while using ray.

ISSUE 1

Importing pytorch_lightning seems to throw an error with 0.8.7 version.

Cause: from ray.tune.integration.pytorch_lightning import TuneReportCallback, TuneReportCheckpointCallback

Error: ModuleNotFoundError: No module named 'ray.tune.integration.pytorch_lightning'

Module versions: ray 0.8.7 tensorflow 2.1.0 python 3.7.4

ISSUE 1: FIX

I fixed this by installing ray using a wheel. ray 0.9.0.dev0

ISSUE 2

With the new ray version, when I submit a slurm job to run the tuning I get the following error:

ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train() (pid=4432, ip=172.26.92.190) File "/home/user/.local/lib/python3.7/site-packages/ray/tune/function_runner.py", line 227, in run self._entrypoint() File "/home/user/.local/lib/python3.7/site-packages/ray/tune/function_runner.py", line 290, in entrypoint self._status_reporter.get_checkpoint()) File "/home/user/.local/lib/python3.7/site-packages/ray/tune/function_runner.py", line 497, in _trainable_func output = train_func(config) File "tune.py", line 261, in train_run trainer.fit(model) File "/home/user/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn result = fn(self, *args, **kwargs) File "/home/user/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit results = self.accelerator_backend.train(model) File "/home/user/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train results = self.trainer.run_pretrain_routine(model) File "/home/user/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1184, in run_pretrain_routine self.register_slurm_signal_handlers() File "/home/user/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 240, in register_slurm_signal_handlers signal.signal(signal.SIGUSR1, self.sig_handler) File "/usr/local/easybuild-2019/easybuild/software/mpi/gcc/8.3.0/openmpi/3.1.4/python/3.7.4/lib/python3.7/signal.py", line 47, in signal handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler)) ValueError: signal only works in main thread

Can I get some advice on how to proceed after this?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (6 by maintainers)

Commits related to this issue

Most upvoted comments

Can you try this hack? Add

os.environ["SLURM_JOB_NAME"] = "bash"

to your Python script?

I also have this problem, downgrading PTL to 1.4.8 seems to solve it. Pretty sure something in 1.5.0 broke it, but I don’t know if 1.5.1 fixes it.

Same issue. PTL 1.4.8 works, 1.5.1 and 1.5.2 do not.

Thank you sooooo much.

I also encountered the same issue ValueError: signal only works in main thread of the main interpreter while following the tutorial, Using PyTorch Lightning with Tune.

The problem was finally solved by downgrading PTL from 1.5.2 to 1.4.8.

Package manager:

  • conda 4.10.1

Module Version and the Change:

  • pytorch 1.10.0
  • pytorch-lightning 1.5.2 => 1.4.8
  • ray 1.9.0
  • python 3.9.7

Is this an issue with Ray or an issue with Lightning? I’m having the same problem

Same issue. PTL 1.4.8 works, 1.5.1 and 1.5.2 do not.

I also have this problem, downgrading PTL to 1.4.8 seems to solve it. Pretty sure something in 1.5.0 broke it, but I don’t know if 1.5.1 fixes it.

Running into the same issue, except using Weights and Biases, and the posted solution does not work. Have you guys determined whether this is a PL or Ray issue?

Hi @richardliaw ,

The hack seems to have fixed it.

Thank you!