DeepSpeed: [BUG] cpu_adam warning

I encounterd a warning as the training begins:

cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!

Here is my environments:

Ubuntu 18.04 LTS
CUDA 11.8
python=3.9
torch=2.0.0
deepspeed=0.9.2
python3.9-dev

ds_report: image

I searched over all the issues and found no worked solutions.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (11 by maintainers)

Most upvoted comments

If you are using conda, we have an environment.yaml here that you can use and has worked for others.

But to debug, I would try the following:

nvcc --version
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

That should tell you more about your cuda/torch+cuda installs to then debug if they are installed or not seen by Python or what.

Ok after messing around with some other things and reinstalling (there was some other errors with fusedADAM), I ended up just allocating more CPU memory. It seems that the CPUAdamOps were very memory hungry. Thanks for your help in the installation!

Hi @kiddyboots216 - the warning is thrown from here which is in the op_builder. Can you try, when you install DeepSpeed, running DS_BUILD_CPU_ADAM=1 pip install deepspeed so the ops will be pre-compiled and we can debug the error that way?

I tried a few more debugging things but the minimal example I provided above still results in deepspeed killing the subprocess. Let me know if you are able to repro.

@KeeratKG - the error is here:

AssertionError: CUDA_HOME does not exist, unable to compile CUDA op(s)

It seems you’d need to set your CUDA_HOME env var. However, this can sometimes be a symptom of CUDA not being installed correctly. You may want to just re-install CUDA to be sure.