TensorRT-LLM: gptManagerBenchmark std::bad_alloc error

machine: nv 4090 24GB model: llama13B-gptq (the GPU memory should be enough) Problem: std::bad_alloc error when starting GptManager. Expected: runs successfully.

root@ubuntu-devel:/code/tensorrt_llm/cpp/build/benchmarks# CUDA_VISIBLE_DEVICES=0  ./gptManagerBenchmark     --model llama13b_gptq_compiled     --engine_dir /code/tensorrt_llm/models/llama13b_gptq_compiled     --type IFB     --dataset /code/tensorrt_llm/models/llama13b_gptq/preprocessed_dataset.json  --log_level verbose --kv_cache_free_gpu_mem_fraction 0.2
[TensorRT-LLM][INFO] Set logger level by TRACE
[TensorRT-LLM][DEBUG] Registered plugin creator Identity version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator BertAttention version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator GPTAttention version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Gemm version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Send version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Recv version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator AllReduce version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator AllGather version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Layernorm version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Rmsnorm version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator SmoothQuantGemm version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator LayernormQuantization version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator QuantizePerToken version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator QuantizeTensor version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator RmsnormQuantization version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator WeightOnlyGroupwiseQuantMatmul version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator WeightOnlyQuantMatmul version 1 in namespace tensorrt_llm
[TensorRT-LLM][DEBUG] Registered plugin creator Lookup version 1 in namespace tensorrt_llm
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][ERROR] std::bad_alloc

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 19 (2 by maintainers)

Most upvoted comments

The fixed MR is: https://github.com/NVIDIA/TensorRT-LLM/pull/152, we tested it and will merge it. Thank you all for your support and help! @zhaoyang-star @ljayx @ryxli @gesanqiu @clockfly

Shixiaowei02 on Oct 27, 2023

Thanks for the patience. We have found the root cause and are now working on the fix. We will push the fix(with other enhancements) in the recent days, and when it gets pushed, a new “announcement” will also get updated.

June

juney-nvidia on Oct 27, 2023

Hi June, how’s the issue going. I’m stuck here. From the backtrace, it looks like the binary throwed from std::filesystem::path ctor. Not sure if CXX11_ABI matters since it’s different with CXX17.

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
*** Process received signal ***
...
...
[ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt28__throw_bad_array_new_lengthv+0x0)[0x7f790fd70265]
[ 9] /ljay/workspace/local/TensorRT-LLM/cpp/bbb/tensorrt_llm/libtensorrt_llm.so(_ZNSt10filesystem7__cxx114pathC1ERKS1_+0xff)[0x7f794d1995ef]
[10] /ljay/workspace/local/TensorRT-LLM/cpp/bbb/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager18TrtGptModelFactory6createERKNSt10filesystem7__cxx114pathENS0_15TrtGptModelTypeEiNS0_15batch_scheduler15SchedulerPolicyERKNS0_25TrtGptModelOptionalParamsE+0xcf)[0x7f794d19bf8f]

I resolved the issue. The root cause is CXX11_ABI related.

Root cause: The Dockerfile skipped install pytorch, but the default pytorch inside nvcr.io/nvidia/pytorch:23.08-py3 has cxx11_abi. The CMakeLists.txt enabled USE_CXX11_ABI as per the pytorch cxx11_abi. Since GptManager used std::filesystem::path and this api is different between C++11 and C++17, the binary throwed from the ctor.

Solution: Just:

bash install_pytorch.sh src_non_cxx11_abi

It should an issue would confuse users.

ljayx on Oct 25, 2023