OpenNMT-py: Option -gpuid not working as it should

When I use -gpuid option in current master of OpenNMT, regardless of what gpu I choose, the training script always chooses gpu 0. If I use multigpu, for example, -gpuid 2 3, the training goes to gpus 0 and 1.

Is this a known issue?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 27 (20 by maintainers)

Most upvoted comments

For anyone who’s having this problem, if you want to train on gpus that are not gpu 0, you can change the visible devices with CUDA_VISIBLE_DEVICES. For example, doing export CUDA_VISIBLE_DEVICES=1,2 will make your experiments go to gpu 1 and then gpu 2 in the current master.

The process never appears in nvidia-smi . The exception happens just after starting OpenNMT-py, so I ran OpenNMT-py several times (in a loop) while I was watching nvidia-smi and OpenNMT-py never appeared in the output.

However, I was mistaken: the problem is not 100% reproducible. In 2 different nodes I can reproduce it systematically but in a third node I can not reproduce it, so there are other factors going on here. I am running in a slurm cluster; in theory it should not affect how stuff works, but I am no expert in slurm, so not totally sure.

@vince62s, I’m pretty sure this was not the behavior of OpenNMT-py as I used that option all the time to train on gpus other than 0, without changing the CUDA_VISIBLE_DEVICES