safe-rlhf: [BUG][Upstream] py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

Required prerequisites

I have read the documentation https://safe-rlhf.readthedocs.io.
I have searched the Issue Tracker and Discussions that this hasn’t already been reported. (+1 or comment there if it has.)
Consider asking first in a Discussion.

Questions

您好，我按照readme里的教程安装好环境后，训练sft模型时报错，具体信息如下：

│ /opt/conda/envs/safe-rlhf/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1535 in      │
│ _jit_compile                                                                                     │
│                                                                                                  │
│   1532 │   if is_standalone:                                                                     │
│   1533 │   │   return _get_exec_path(name, build_directory)                                      │
│   1534 │                                                                                         │
│ ❱ 1535 │   return _import_module_from_library(name, build_directory, is_python_module)           │
│   1536                                                                                           │
│   1537                                                                                           │
│   1538 def _write_ninja_file_and_compile_objects(                                                │
│                                                                                                  │
│ /opt/conda/envs/safe-rlhf/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1929 in      │
│ _import_module_from_library                                                                      │
│                                                                                                  │
│   1926 │   │   # https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-ful  │
│   1927 │   │   spec = importlib.util.spec_from_file_location(module_name, filepath)              │
│   1928 │   │   assert spec is not None                                                           │
│ ❱ 1929 │   │   module = importlib.util.module_from_spec(spec)                                    │
│   1930 │   │   assert isinstance(spec.loader, importlib.abc.Loader)                              │
│   1931 │   │   spec.loader.exec_module(module)                                                   │
│   1932 │   │   return module                                                                     │
│ <frozen importlib._bootstrap>:571 in module_from_spec                                            │
│ <frozen importlib._bootstrap_external>:1176 in create_module                                     │
│ <frozen importlib._bootstrap>:241 in _call_with_frames_removed                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

请问这个要如何fix哈？

About this issue

Original URL
State: closed
Created a year ago
Comments: 27 (10 by maintainers)

Most upvoted comments

@XuehaiPan 您好，我这边终于跑起来了，我大概总结了一下，经过反复构建和日志分析，核心问题主要是在环境变量和构建这块：

~/.cache/.baserc 里不能有多余的环境变量，尤其是不能提前初始化 conda 的默认变量，配置里只要加入以下几个环境变量：

export CUDA_HOME="/usr/local/cuda-xx.x"
export PATH="${CUDA_HOME}/bin${PATH:+:"${PATH}"}"
export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64${LD_LIBRARY_PATH:+:"${LD_LIBRARY_PATH}"}"
export NCCL_P2P_DISABLE=1

设置完成后一定要 source 一下

构建之前一定要删除

rm -r ~/.cache/torch
rm -r ~/.cache/torch_extensions

尤其是要删除 ~/.cache/torch_extensions，只要构建失败就得删除

执行 README 里的 conda env create --file conda-recipe.yaml
开始执行程序，等待构建，初次构建时间会很长(我第一次构建花了3小时），中途断了就要立刻执行第二步再进行构建。

额外补充一下：mpi4py最好选择3.1.3，我用@XuehaiPan提供的测试脚本时发现mpi4py提示有问题，用conda install的方式安装后，最后一次尝试时成功运行了程序

最终运行效果log：

Training 1/3 epoch:   0%|          | 0/4878 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
Training 1/3 epoch (loss 1.4824):   0%|          | 7/4878 [00:53<9:59:11,  7.38s/it] [2023-05-18 13:59:11,317] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
Training 1/3 epoch (loss 1.6123):   0%|          | 15/4878 [01:51<9:46:47,  7.24s/it][2023-05-18 14:00:09,294] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
Training 1/3 epoch (loss 1.6680):   0%|          | 23/4878 [02:49<9:45:02,  7.23s/it][2023-05-18 14:01:07,138] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
Training 1/3 epoch (loss 1.6025):   1%|          | 31/4878 [03:47<9:44:48,  7.24s/it][2023-05-18 14:02:05,047] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
Training 1/3 epoch (loss 1.6426):   1%|          | 39/4878 [04:45<9:44:14,  7.24s/it][2023-05-18 14:03:02,992] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728

最后非常非常感谢项目作者@XuehaiPan 每天的答疑和分析，帮我最终锁定到了问题所在

iamsile on May 30, 2023

@iamsile 目前我搜索到的解决方案是清空 cache 就可以的。或许你可以试试：

export NCCL_P2P_DISABLE=1
rm -r ~/.cache/torch
rm -r ~/.cache/torch_extensions

Ref:

microsoft/DeepSpeed#3416
microsoft/DeepSpeed#2176
huggingface/transformers#12418

XuehaiPan on May 17, 2023