DeepSpeed: [BUG] cpu_adam warning
I encounterd a warning as the training begins:
cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Here is my environments:
Ubuntu 18.04 LTS
CUDA 11.8
python=3.9
torch=2.0.0
deepspeed=0.9.2
python3.9-dev
ds_report:
I searched over all the issues and found no worked solutions.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (11 by maintainers)
If you are using conda, we have an environment.yaml here that you can use and has worked for others.
But to debug, I would try the following:
That should tell you more about your cuda/torch+cuda installs to then debug if they are installed or not seen by Python or what.
Ok after messing around with some other things and reinstalling (there was some other errors with fusedADAM), I ended up just allocating more CPU memory. It seems that the CPUAdamOps were very memory hungry. Thanks for your help in the installation!
I tried a few more debugging things but the minimal example I provided above still results in deepspeed killing the subprocess. Let me know if you are able to repro.
@KeeratKG - the error is here:
It seems you’d need to set your CUDA_HOME env var. However, this can sometimes be a symptom of CUDA not being installed correctly. You may want to just re-install CUDA to be sure.