avalanche: Tests fail with CUDA 10.1

I am launching tests on a server with 4 V100, CUDA 10.1. Tests fail with the following error: RunTime Error: The NVIDIA driver on your system is too old (found version 10010). Please update your GPU driver [bla bla]. Alternatively go to Pytorch and install a pytorch version that has been compiled with your version of the CUDA driver. I also recreated the environment from scratch with conda env create -f environment.yml.

Tests successfully complete with CPU.

Any ideas? How can I specify the CUDA driver when installing pytorch from the environment file?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (11 by maintainers)

Most upvoted comments

I agree with @AntonioCarta. I specified pytorch version into the .yml (e.g. pytorch::pytorch==1.7.1) but the error remains. I think it’s up to the user to specify its CUDA version. Something like

CUDA_VERSION = args[1] # 9.2, 10.1, 10.2, 11.0, cpu
conda env create -f environment.yml
conda activate avalanche-env
if CUDA_VERSION == cpu:
    conda install pytorch torchvision torchaudio cpuonly -c pytorch
else:
    conda install pytorch torchvision torchaudio cudatoolkit=CUDA_VERSION -c pytorch

Yes, but it depends on the drivers installed on the GPUs. We cannot control or force the installation of the drivers, so the user should provide the currently installed version of the cuda toolkit. As an example, one of our servers has the nvidia drivers version 440.100, which supports only a cudatoolkit up to 10.2. If you install the conda environment and try to run some tests the same message appears and the computation is done on the CPU.

Probably we didn’t notice this bug because we have never recreated our environments from scratch and we perform the remote testing on CPU.