OpenNMT-py: Option -gpuid not working as it should
When I use -gpuid option in current master of OpenNMT, regardless of what gpu I choose, the training script always chooses gpu 0.
If I use multigpu, for example, -gpuid 2 3, the training goes to gpus 0 and 1.
Is this a known issue?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 27 (20 by maintainers)
For anyone who’s having this problem, if you want to train on gpus that are not gpu 0, you can change the visible devices with
CUDA_VISIBLE_DEVICES. For example, doingexport CUDA_VISIBLE_DEVICES=1,2will make your experiments go to gpu 1 and then gpu 2 in the current master.The process never appears in
nvidia-smi. The exception happens just after starting OpenNMT-py, so I ran OpenNMT-py several times (in a loop) while I was watchingnvidia-smiand OpenNMT-py never appeared in the output.However, I was mistaken: the problem is not 100% reproducible. In 2 different nodes I can reproduce it systematically but in a third node I can not reproduce it, so there are other factors going on here. I am running in a slurm cluster; in theory it should not affect how stuff works, but I am no expert in slurm, so not totally sure.
@vince62s, I’m pretty sure this was not the behavior of OpenNMT-py as I used that option all the time to train on gpus other than 0, without changing the CUDA_VISIBLE_DEVICES