FasterTransformer: Illegal memory access error during run BLOOM-176B
Branch/Tag/Commit
main/38847d07740fbfb4b3a74b428b3b8b57cc13c5c8
Docker Image Version
nvcr.io/nvidia/pytorch:22.09-py3
GPU name
A100 * 8
CUDA Driver
515.43.04
Reproduced Steps
When I try to run BLOOM-176B by FasterTransformer in 8 card, an error "CUDA error: an illegal memory access was encountered" is raised. The detailed log is attached in the end.
I almost completely follow the instruction in example `Run BLOOM on PyTorch` in `gpt_guide.md`. The only different thing is that I set argument `--data-type` in huggingface_bloom_convert.py to "fp16" while the default is "fp32".
Reproduce procedure:
1. Convert bloom 176b to FasterTransformer c-model
`python ../examples/pytorch/gpt/utils/huggingface_bloom_convert.py --input-dir bloom --output-dir bloom/c-model --data-type "fp16" -tp 8 -p 4 -v`
2. Run FT benchmark(parallel tensor size 8)
`mpirun -n 8 --allow-run-as-root python ../examples/pytorch/gpt/bloom_lambada.py --checkpoint-path bloom/c-model/8-gpu --tokenizer-path bloom --dataset-path ../datasets/lambada/lambada_test.jsonl --show-progress`
Part of error Log:
assert(self.pre_embed_idx < self.post_embed_idx, "Pre decoder embedding index should be lower than post decoder embedding index.") │································· 0%| | 0/645 [00:00<?, ?it/s][FT][ERROR] CUDA error: an illegal memory access was encountered │·································
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. │·································
For debugging consider passing CUDA_LAUNCH_BLOCKING=1. │·································
Exception raised from alloc_block at /opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp:1345 (most recent call first): │·································
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f23b3470│·································
d0c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) │·································
frame #1: <unknown function> + 0x431aa (0x7f23b34f01aa in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) │·································
frame #2: <unknown function> + 0x43efe (0x7f23b34f0efe in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) │·································frame #3: <unknown function> + 0x457d2 (0x7f23b34f27d2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) │·································
frame #4: <unknown function> + 0x45ac8 (0x7f23b34f2ac8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) │·································
frame #5: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>) + 0│·································
x9a7 (0x7f23e585eaf7 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) │·································
frame #6: at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, c10::optional<c10::Device>, c10::optional<c10::MemoryFormat>) + 0x9a (0x7f23b4│·································
31fe0a in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 23
It works!Thanks a lot.