safe-rlhf: [BUG][Upstream] py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
Required prerequisites
- I have read the documentation https://safe-rlhf.readthedocs.io.
- I have searched the Issue Tracker and Discussions that this hasn’t already been reported. (+1 or comment there if it has.)
- Consider asking first in a Discussion.
Questions
您好,我按照readme里的教程安装好环境后,训练sft模型时报错,具体信息如下:
│ /opt/conda/envs/safe-rlhf/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1535 in │
│ _jit_compile │
│ │
│ 1532 │ if is_standalone: │
│ 1533 │ │ return _get_exec_path(name, build_directory) │
│ 1534 │ │
│ ❱ 1535 │ return _import_module_from_library(name, build_directory, is_python_module) │
│ 1536 │
│ 1537 │
│ 1538 def _write_ninja_file_and_compile_objects( │
│ │
│ /opt/conda/envs/safe-rlhf/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1929 in │
│ _import_module_from_library │
│ │
│ 1926 │ │ # https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-ful │
│ 1927 │ │ spec = importlib.util.spec_from_file_location(module_name, filepath) │
│ 1928 │ │ assert spec is not None │
│ ❱ 1929 │ │ module = importlib.util.module_from_spec(spec) │
│ 1930 │ │ assert isinstance(spec.loader, importlib.abc.Loader) │
│ 1931 │ │ spec.loader.exec_module(module) │
│ 1932 │ │ return module │
│ <frozen importlib._bootstrap>:571 in module_from_spec │
│ <frozen importlib._bootstrap_external>:1176 in create_module │
│ <frozen importlib._bootstrap>:241 in _call_with_frames_removed │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
请问这个要如何fix哈?
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 27 (10 by maintainers)
@XuehaiPan 您好,我这边终于跑起来了,我大概总结了一下,经过反复构建和日志分析,核心问题主要是在环境变量和构建这块:
~/.cache/.baserc
里不能有多余的环境变量,尤其是不能提前初始化conda
的默认变量,配置里只要加入以下几个环境变量:设置完成后一定要
source
一下尤其是要删除
~/.cache/torch_extensions
,只要构建失败就得删除执行 README 里的
conda env create --file conda-recipe.yaml
开始执行程序,等待构建,初次构建时间会很长(我第一次构建花了3小时),中途断了就要立刻执行第二步再进行构建。
额外补充一下:mpi4py最好选择3.1.3,我用@XuehaiPan提供的测试脚本时发现mpi4py提示有问题,用conda install的方式安装后,最后一次尝试时成功运行了程序
最终运行效果log:
最后非常非常感谢项目作者@XuehaiPan 每天的答疑和分析,帮我最终锁定到了问题所在
@iamsile 目前我搜索到的解决方案是清空 cache 就可以的。或许你可以试试:
Ref: