infinigen: When using cuda terrain with slurm jobs, encountered OSError: No such file or directory

Describe the bug

In the rendering stage (task = render), when enabling cuda terrain and executing the task with slurm jobs, I encountered the following error:

  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/core/execute_tasks.py", line 418, in main
    execute_tasks(
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/core/execute_tasks.py", line 340, in execute_tasks
    terrain = Terrain(scene_seed, surface.registry, task=task, on_the_fly_asset_folder=output_folder/"assets")
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/core.py", line 120, in __init__
    self.elements, scene_infos = scene(seed, Path(on_the_fly_asset_folder), asset_path, device)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/scene.py", line 56, in scene
    elements[ElementNames.LandTiles] = LandTiles(device, caves, on_the_fly_asset_folder, reused_asset_folder)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/elements/landtiles.py", line 115, in __init__
    Element.__init__(self, "landtiles", material, transparency)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/elements/core.py", line 28, in __init__
    dll = load_cdll(f"terrain/lib/{self.device}/elements/{lib_name_X}.so")
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/utils/ctype_util.py", line 29, in load_cdll
    return CDLL(root/path, mode=RTLD_LOCAL)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/lib/cuda/elements/landtiles_2.so: cannot open shared object file: No such file or directory
  In call to configurable 'Element' (<class 'infinigen.terrain.elements.core.Element'>)
  In call to configurable 'LandTiles' (<class 'infinigen.terrain.elements.landtiles.LandTiles'>)
  In call to configurable 'scene' (<function scene at 0x7faeac694f70>)
  In call to configurable 'Terrain' (<class 'infinigen.terrain.core.Terrain'>)
  In call to configurable 'execute_tasks' (<function execute_tasks at 0x7faea1d5bd90>)
keep_placeholder=True placeholder.name='BushFactory(543568399).spawn_placeholder(514)' list(placeholder.children)=[bpy.data.objects['BushFactory(543568399).spawn_asset(514)']] obj.name='BushFactory(543568399).spawn_asset(514)' list(obj.children)=[bpy.data.objects['Tree.656']]
ground already loaded, loading ground_1 instead
landtiles already loaded, loading landtiles_2 instead

The same script runs successfully in a slurm interactive session.

What version of the code were you using?

commit 5132903cd68704367d1c44c841e5163158e0f33d (HEAD -> main, origin/main, origin/HEAD)

What are your FULL output logs?

7348694_0_7348695_default.log

Platform

  • OS & OS Version: Linux
  • GPU: A5000
  • GPU Driver Version: cuda 11.7

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

I see. It is tricky. The Terrain() gets called several times when we run multiple tasks in a command. We didn’t test it and this caused the bug. We will fix it and before that you can try running tasks separately, at least separating coarse, fineterrain, and render. Actually separating tasks is also recommended for better usage of resources.