safe-rlhf: [BUG][Upstream] py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

Required prerequisites

Questions

您好,我按照readme里的教程安装好环境后,训练sft模型时报错,具体信息如下:

│ /opt/conda/envs/safe-rlhf/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1535 in      │
│ _jit_compile                                                                                     │
│                                                                                                  │
│   1532 │   if is_standalone:                                                                     │
│   1533 │   │   return _get_exec_path(name, build_directory)                                      │
│   1534 │                                                                                         │
│ ❱ 1535 │   return _import_module_from_library(name, build_directory, is_python_module)           │
│   1536                                                                                           │
│   1537                                                                                           │
│   1538 def _write_ninja_file_and_compile_objects(                                                │
│                                                                                                  │
│ /opt/conda/envs/safe-rlhf/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1929 in      │
│ _import_module_from_library                                                                      │
│                                                                                                  │
│   1926 │   │   # https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-ful  │
│   1927 │   │   spec = importlib.util.spec_from_file_location(module_name, filepath)              │
│   1928 │   │   assert spec is not None                                                           │
│ ❱ 1929 │   │   module = importlib.util.module_from_spec(spec)                                    │
│   1930 │   │   assert isinstance(spec.loader, importlib.abc.Loader)                              │
│   1931 │   │   spec.loader.exec_module(module)                                                   │
│   1932 │   │   return module                                                                     │
│ <frozen importlib._bootstrap>:571 in module_from_spec                                            │
│ <frozen importlib._bootstrap_external>:1176 in create_module                                     │
│ <frozen importlib._bootstrap>:241 in _call_with_frames_removed                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

请问这个要如何fix哈?

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 27 (10 by maintainers)

Most upvoted comments

@XuehaiPan 您好,我这边终于跑起来了,我大概总结了一下,经过反复构建和日志分析,核心问题主要是在环境变量和构建这块:

  1. ~/.cache/.baserc 里不能有多余的环境变量,尤其是不能提前初始化 conda 的默认变量,配置里只要加入以下几个环境变量:
export CUDA_HOME="/usr/local/cuda-xx.x"
export PATH="${CUDA_HOME}/bin${PATH:+:"${PATH}"}"
export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64${LD_LIBRARY_PATH:+:"${LD_LIBRARY_PATH}"}"
export NCCL_P2P_DISABLE=1

设置完成后一定要 source 一下

  1. 构建之前一定要删除
rm -r ~/.cache/torch
rm -r ~/.cache/torch_extensions

尤其是要删除 ~/.cache/torch_extensions,只要构建失败就得删除

  1. 执行 README 里的 conda env create --file conda-recipe.yaml

  2. 开始执行程序,等待构建,初次构建时间会很长(我第一次构建花了3小时),中途断了就要立刻执行第二步再进行构建。

额外补充一下:mpi4py最好选择3.1.3,我用@XuehaiPan提供的测试脚本时发现mpi4py提示有问题,用conda install的方式安装后,最后一次尝试时成功运行了程序

最终运行效果log:

Training 1/3 epoch:   0%|          | 0/4878 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
Training 1/3 epoch (loss 1.4824):   0%|          | 7/4878 [00:53<9:59:11,  7.38s/it] [2023-05-18 13:59:11,317] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
Training 1/3 epoch (loss 1.6123):   0%|          | 15/4878 [01:51<9:46:47,  7.24s/it][2023-05-18 14:00:09,294] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
Training 1/3 epoch (loss 1.6680):   0%|          | 23/4878 [02:49<9:45:02,  7.23s/it][2023-05-18 14:01:07,138] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
Training 1/3 epoch (loss 1.6025):   1%|          | 31/4878 [03:47<9:44:48,  7.24s/it][2023-05-18 14:02:05,047] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
Training 1/3 epoch (loss 1.6426):   1%|          | 39/4878 [04:45<9:44:14,  7.24s/it][2023-05-18 14:03:02,992] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728

最后非常非常感谢项目作者@XuehaiPan 每天的答疑和分析,帮我最终锁定到了问题所在

@iamsile 目前我搜索到的解决方案是清空 cache 就可以的。或许你可以试试:

export NCCL_P2P_DISABLE=1
rm -r ~/.cache/torch
rm -r ~/.cache/torch_extensions

Ref:

  • microsoft/DeepSpeed#3416
  • microsoft/DeepSpeed#2176
  • huggingface/transformers#12418