TensorRT-LLM: build llama fail

# pip list | grep torch pytorch-quantization 2.1.2 torch 1.12.1+cu113 torch-tensorrt 2.0.0.dev0 torchdata 0.7.0a0 torchtext 0.16.0a0 torchvision 0.16.0a0

# nvidia-smi ±----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.2 | |-------------------------------±---------------------±---------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A10 On | 00000000:00:07.0 Off | 0 | | 0% 24C P8 15W / 150W | 0MiB / 23028MiB | 0% Default | | | | N/A | ±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | ±----------------------------------------------------------------------------+

# make -C docker release_build CUDA_ARCHS=“89-real;90-real” # make -C docker release_run # python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt # pip install ./build/tensorrt_llm.whl* # python build.py --model_dir /code/model/llama/llama-2-7b-hf
–dtype float16
–remove_input_padding
–use_gpt_attention_plugin float16
–enable_context_fmha
–use_gemm_plugin float16
–output_dir /code/model/llama_tensor
–world_size 8
–tp_size 8

build llama fail

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1099, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 32, in <module>
    from ...modeling_utils import PreTrainedModel
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 86, in <module>
    from accelerate import dispatch_model, infer_auto_device_map, init_empty_weights
  File "/usr/local/lib/python3.10/dist-packages/accelerate/__init__.py", line 3, in <module>
    from .accelerator import Accelerator
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 34, in <module>
    from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
  File "/usr/local/lib/python3.10/dist-packages/accelerate/checkpointing.py", line 24, in <module>
    from .utils import (
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/__init__.py", line 112, in <module>
    from .launch import (
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/launch.py", line 27, in <module>
    from ..utils.other import merge_dicts
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/other.py", line 24, in <module>
    from .transformer_engine import convert_model
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/transformer_engine.py", line 21, in <module>
    import transformer_engine.pytorch as te
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/__init__.py", line 6, in <module>
    from .module import LayerNormLinear
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/__init__.py", line 6, in <module>
    from .layernorm_linear import LayerNormLinear
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/layernorm_linear.py", line 15, in <module>
    from .. import cpp_extensions as tex
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/cpp_extensions/__init__.py", line 6, in <module>
    from transformer_engine_extensions import *
ImportError: /usr/local/lib/python3.10/dist-packages/transformer_engine_extensions.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/code/tensorrt_llm/examples/llama/build.py", line 24, in <module>
    from transformers import LlamaConfig, LlamaForCausalLM
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1090, in __getattr__
    value = getattr(module, name)
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1089, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1101, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/dist-packages/transformer_engine_extensions.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Comments: 18

Most upvoted comments

@ixp9891 I recommend increasing your swap. I was able to build Llama-7b with limited RAM of 16 GB when I increased my swap (on G5.xlarge).

In this case, you can monitor your RAM usage through tools like nmon or ntop during the build phase.

sudo fallocate -l 128G /swapfile && chmod 600 /swapfile && mkswap /swapfile && sudo swapon /swapfile && free -h 

Additionally, I believe you need to build TensorRT-LLM with both SM80 and SM86 together.

make -C docker release_build CUDA_ARCHS="80-real;86-real"