FasterTransformer: Illegal memory access error during run BLOOM-176B

Branch/Tag/Commit

main/38847d07740fbfb4b3a74b428b3b8b57cc13c5c8

Docker Image Version

nvcr.io/nvidia/pytorch:22.09-py3

GPU name

A100 * 8

CUDA Driver

515.43.04

Reproduced Steps

When I try to run BLOOM-176B by FasterTransformer in 8 card, an error "CUDA error: an illegal memory access was encountered" is raised. The detailed log is attached in the end.
I almost completely follow the instruction in example `Run BLOOM on PyTorch` in `gpt_guide.md`. The only different thing is that I set argument `--data-type` in huggingface_bloom_convert.py to "fp16" while the default is "fp32".
Reproduce procedure:
1. Convert bloom 176b to FasterTransformer c-model
`python ../examples/pytorch/gpt/utils/huggingface_bloom_convert.py --input-dir bloom --output-dir bloom/c-model --data-type "fp16" -tp 8 -p 4 -v`
2. Run FT benchmark(parallel tensor size 8)
`mpirun -n 8 --allow-run-as-root python ../examples/pytorch/gpt/bloom_lambada.py --checkpoint-path bloom/c-model/8-gpu --tokenizer-path bloom --dataset-path ../datasets/lambada/lambada_test.jsonl --show-progress`

Part of error Log:

assert(self.pre_embed_idx < self.post_embed_idx, "Pre decoder embedding index should be lower than post decoder embedding index.")                 │·································  0%|          | 0/645 [00:00<?, ?it/s][FT][ERROR] CUDA error: an illegal memory access was encountered                                              │·································
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.                               │·································
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                               │·································
Exception raised from alloc_block at /opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp:1345 (most recent call first):                           │·································
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f23b3470│·································
d0c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)                                                                                   │·································
frame #1: <unknown function> + 0x431aa (0x7f23b34f01aa in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)                           │·································
frame #2: <unknown function> + 0x43efe (0x7f23b34f0efe in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)                           │·································frame #3: <unknown function> + 0x457d2 (0x7f23b34f27d2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)                           │·································
frame #4: <unknown function> + 0x45ac8 (0x7f23b34f2ac8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)                           │·································
frame #5: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>) + 0│·································
x9a7 (0x7f23e585eaf7 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)                                                            │·································
frame #6: at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, c10::optional<c10::Device>, c10::optional<c10::MemoryFormat>) + 0x9a (0x7f23b4│·································
31fe0a in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)

About this issue

Most upvoted comments

It seems nothing change. For bloom 175B TP=8

export FT_DEBUG_LEVEL=ON
mpirun -n 8 -mca btl ^openib ./bin/multi_gpu_gpt_example
set-zw04-kubernetes-pc38:175172:175172 [5] NCCL INFO comm 0xa7b34a0 rank 0 nranks 1 cudaDev 5 busId a7000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=5 world_size=8 tensor_para=NcclParam[rank=5, world_size=8, nccl_comm=0xa6da2a0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0xa7b34a0]
set-zw04-kubernetes-pc38:175170:175170 [3] NCCL INFO Connected all rings
set-zw04-kubernetes-pc38:175170:175170 [3] NCCL INFO Connected all trees
set-zw04-kubernetes-pc38:175170:175170 [3] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 00/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 01/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 02/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 03/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 04/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 05/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 06/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 07/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 08/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 09/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 10/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 11/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 12/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 13/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 14/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 15/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 16/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 17/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 18/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 19/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 20/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 21/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 22/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 23/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 24/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 25/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 26/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 27/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 28/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 29/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 30/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Channel 31/32 :    0
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Connected all rings
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO Connected all trees
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
set-zw04-kubernetes-pc38:175173:175173 [6] NCCL INFO comm 0xacbb500 rank 0 nranks 1 cudaDev 6 busId e1000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=6 world_size=8 tensor_para=NcclParam[rank=6, world_size=8, nccl_comm=0xabe2300] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0xacbb500]
set-zw04-kubernetes-pc38:175174:175174 [7] NCCL INFO comm 0xa749470 rank 0 nranks 1 cudaDev 7 busId e7000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=7 world_size=8 tensor_para=NcclParam[rank=7, world_size=8, nccl_comm=0xa670270] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0xa749470]
set-zw04-kubernetes-pc38:175169:175169 [2] NCCL INFO comm 0xaf40510 rank 0 nranks 1 cudaDev 2 busId 65000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=2 world_size=8 tensor_para=NcclParam[rank=2, world_size=8, nccl_comm=0xae67310] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0xaf40510]
set-zw04-kubernetes-pc38:175167:175167 [0] NCCL INFO comm 0x97a0490 rank 0 nranks 1 cudaDev 0 busId 26000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=0 world_size=8 tensor_para=NcclParam[rank=0, world_size=8, nccl_comm=0x96c7290] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x97a0490]
set-zw04-kubernetes-pc38:175170:175170 [3] NCCL INFO comm 0xb1623f0 rank 0 nranks 1 cudaDev 3 busId 6a000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=3 world_size=8 tensor_para=NcclParam[rank=3, world_size=8, nccl_comm=0xb0891f0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0xb1623f0]
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO comm 0xa928490 rank 0 nranks 1 cudaDev 1 busId 2c000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=1 world_size=8 tensor_para=NcclParam[rank=1, world_size=8, nccl_comm=0xa84f290] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0xa928490]
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
after allocation    : free: 30.91 GB, total: 79.35 GB, used: 48.44 GB
after allocation    : free: 30.77 GB, total: 79.35 GB, used: 48.58 GB
after allocation    : free: 30.77 GB, total: 79.35 GB, used: 48.58 GB
after allocation    : free: 30.77 GB, total: 79.35 GB, used: 48.58 GB
after allocation    : free: 30.91 GB, total: 79.35 GB, used: 48.44 GB
after allocation    : free: 30.77 GB, total: 79.35 GB, used: 48.58 GB
after allocation    : free: 30.77 GB, total: 79.35 GB, used: 48.58 GB
after allocation    : free: 30.77 GB, total: 79.35 GB, used: 48.58 GB

set-zw04-kubernetes-pc38:175171:175171 [4] misc/strongstream.cc:240 NCCL WARN Cuda failure 'an illegal memory access was encountered'
set-zw04-kubernetes-pc38:175171:175171 [4] NCCL INFO enqueue.cc:947 -> 1
set-zw04-kubernetes-pc38:175171:175171 [4] NCCL INFO group.cc:140 -> 1
set-zw04-kubernetes-pc38:175171:175171 [4] NCCL INFO group.cc:340 -> 1
set-zw04-kubernetes-pc38:175171:175171 [4] NCCL INFO group.cc:421 -> 1
set-zw04-kubernetes-pc38:175171:175171 [4] NCCL INFO group.cc:106 -> 1
Failed, NCCL error /workdir/FasterTransformer/src/fastertransformer/utils/nccl_utils.cc:64 'unhandled cuda error'

set-zw04-kubernetes-pc38:175169:175169 [2] misc/strongstream.cc:240 NCCL WARN Cuda failure 'an illegal memory access was encountered'
set-zw04-kubernetes-pc38:175169:175169 [2] NCCL INFO enqueue.cc:947 -> 1
set-zw04-kubernetes-pc38:175169:175169 [2] NCCL INFO group.cc:140 -> 1
set-zw04-kubernetes-pc38:175169:175169 [2] NCCL INFO group.cc:340 -> 1
set-zw04-kubernetes-pc38:175169:175169 [2] NCCL INFO group.cc:421 -> 1
set-zw04-kubernetes-pc38:175169:175169 [2] NCCL INFO group.cc:106 -> 1
Failed, NCCL error /workdir/FasterTransformer/src/fastertransformer/utils/nccl_utils.cc:64 'unhandled cuda error'

set-zw04-kubernetes-pc38:175168:175168 [1] misc/strongstream.cc:240 NCCL WARN Cuda failure 'an illegal memory access was encountered'
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO enqueue.cc:947 -> 1
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO group.cc:140 -> 1
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO group.cc:340 -> 1
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO group.cc:421 -> 1
set-zw04-kubernetes-pc38:175168:175168 [1] NCCL INFO group.cc:106 -> 1
Failed, NCCL error /workdir/FasterTransformer/src/fastertransformer/utils/nccl_utils.cc:64 'unhandled cuda error'

set-zw04-kubernetes-pc38:175172:175172 [5] misc/strongstream.cc:240 NCCL WARN Cuda failure 'an illegal memory access was encountered'
set-zw04-kubernetes-pc38:175172:175172 [5] NCCL INFO enqueue.cc:947 -> 1
set-zw04-kubernetes-pc38:175172:175172 [5] NCCL INFO group.cc:140 -> 1
set-zw04-kubernetes-pc38:175172:175172 [5] NCCL INFO group.cc:340 -> 1
set-zw04-kubernetes-pc38:175172:175172 [5] NCCL INFO group.cc:421 -> 1
set-zw04-kubernetes-pc38:175172:175172 [5] NCCL INFO group.cc:106 -> 1
Failed, NCCL error /workdir/FasterTransformer/src/fastertransformer/utils/nccl_utils.cc:64 'unhandled cuda error'

set-zw04-kubernetes-pc38:175170:175170 [3] misc/strongstream.cc:240 NCCL WARN Cuda failure 'an illegal memory access was encountered'
set-zw04-kubernetes-pc38:175170:175170 [3] NCCL INFO enqueue.cc:947 -> 1
set-zw04-kubernetes-pc38:175170:175170 [3] NCCL INFO group.cc:140 -> 1
set-zw04-kubernetes-pc38:175170:175170 [3] NCCL INFO group.cc:340 -> 1
set-zw04-kubernetes-pc38:175170:175170 [3] NCCL INFO group.cc:421 -> 1
set-zw04-kubernetes-pc38:175170:175170 [3] NCCL INFO group.cc:106 -> 1
Failed, NCCL error /workdir/FasterTransformer/src/fastertransformer/utils/nccl_utils.cc:64 'unhandled cuda error'

set-zw04-kubernetes-pc38:175167:175167 [0] misc/strongstream.cc:240 NCCL WARN Cuda failure 'an illegal memory access was encountered'
set-zw04-kubernetes-pc38:175167:175167 [0] NCCL INFO enqueue.cc:947 -> 1
set-zw04-kubernetes-pc38:175167:175167 [0] NCCL INFO group.cc:140 -> 1
set-zw04-kubernetes-pc38:175167:175167 [0] NCCL INFO group.cc:340 -> 1
set-zw04-kubernetes-pc38:175167:175167 [0] NCCL INFO group.cc:421 -> 1
set-zw04-kubernetes-pc38:175167:175167 [0] NCCL INFO group.cc:106 -> 1
Failed, NCCL error /workdir/FasterTransformer/src/fastertransformer/utils/nccl_utils.cc:64 'unhandled cuda error'

set-zw04-kubernetes-pc38:175173:175173 [6] misc/strongstream.cc:240 NCCL WARN Cuda failure 'an illegal memory access was encountered'
set-zw04-kubernetes-pc38:175173:175173 [6] NCCL INFO enqueue.cc:947 -> 1
set-zw04-kubernetes-pc38:175173:175173 [6] NCCL INFO group.cc:140 -> 1
set-zw04-kubernetes-pc38:175173:175173 [6] NCCL INFO group.cc:340 -> 1
set-zw04-kubernetes-pc38:175173:175173 [6] NCCL INFO group.cc:421 -> 1
set-zw04-kubernetes-pc38:175173:175173 [6] NCCL INFO group.cc:106 -> 1
Failed, NCCL error /workdir/FasterTransformer/src/fastertransformer/utils/nccl_utils.cc:64 'unhandled cuda error'

set-zw04-kubernetes-pc38:175174:175174 [7] misc/strongstream.cc:240 NCCL WARN Cuda failure 'an illegal memory access was encountered'
set-zw04-kubernetes-pc38:175174:175174 [7] NCCL INFO enqueue.cc:947 -> 1
set-zw04-kubernetes-pc38:175174:175174 [7] NCCL INFO group.cc:140 -> 1
set-zw04-kubernetes-pc38:175174:175174 [7] NCCL INFO group.cc:340 -> 1
set-zw04-kubernetes-pc38:175174:175174 [7] NCCL INFO group.cc:421 -> 1
set-zw04-kubernetes-pc38:175174:175174 [7] NCCL INFO group.cc:106 -> 1
Failed, NCCL error /workdir/FasterTransformer/src/fastertransformer/utils/nccl_utils.cc:64 'unhandled cuda error'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[60254,1],7]
  Exit code:    1

For bloom 7B TP=8

set-zw04-kubernetes-pc38:177172:177172 [5] NCCL INFO Connected all rings
set-zw04-kubernetes-pc38:177172:177172 [5] NCCL INFO Connected all trees
set-zw04-kubernetes-pc38:177172:177172 [5] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
set-zw04-kubernetes-pc38:177167:177167 [0] NCCL INFO comm 0xa9163f0 rank 0 nranks 1 cudaDev 0 busId 26000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=0 world_size=8 tensor_para=NcclParam[rank=0, world_size=8, nccl_comm=0xa83d1f0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0xa9163f0]
set-zw04-kubernetes-pc38:177169:177169 [2] NCCL INFO comm 0xa0b9420 rank 0 nranks 1 cudaDev 2 busId 65000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=2 world_size=8 tensor_para=NcclParam[rank=2, world_size=8, nccl_comm=0x9fe0220] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0xa0b9420]
set-zw04-kubernetes-pc38:177171:177171 [4] NCCL INFO comm 0xacbe640 rank 0 nranks 1 cudaDev 4 busId a2000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=4 world_size=8 tensor_para=NcclParam[rank=4, world_size=8, nccl_comm=0xabe5440] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0xacbe640]
set-zw04-kubernetes-pc38:177173:177173 [6] NCCL INFO comm 0x9f42540 rank 0 nranks 1 cudaDev 6 busId e1000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=6 world_size=8 tensor_para=NcclParam[rank=6, world_size=8, nccl_comm=0x9e69340] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x9f42540]
set-zw04-kubernetes-pc38:177170:177170 [3] NCCL INFO comm 0xa1594b0 rank 0 nranks 1 cudaDev 3 busId 6a000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=3 world_size=8 tensor_para=NcclParam[rank=3, world_size=8, nccl_comm=0xa0802b0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0xa1594b0]
set-zw04-kubernetes-pc38:177175:177175 [7] NCCL INFO comm 0x9b6a4a0 rank 0 nranks 1 cudaDev 7 busId e7000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=7 world_size=8 tensor_para=NcclParam[rank=7, world_size=8, nccl_comm=0x9a912a0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x9b6a4a0]
set-zw04-kubernetes-pc38:177172:177172 [5] NCCL INFO comm 0xa35d440 rank 0 nranks 1 cudaDev 5 busId a7000 - Init COMPLETE
[FT][INFO] NCCL initialized rank=5 world_size=8 tensor_para=NcclParam[rank=5, world_size=8, nccl_comm=0xa284240] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0xa35d440]
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
after allocation    : free: 74.49 GB, total: 79.35 GB, used:  4.86 GB
after allocation    : free: 74.63 GB, total: 79.35 GB, used:  4.72 GB
after allocation    : free: 74.49 GB, total: 79.35 GB, used:  4.86 GB
after allocation    : free: 74.49 GB, total: 79.35 GB, used:  4.86 GB
after allocation    : free: 74.49 GB, total: 79.35 GB, used:  4.86 GB
after allocation    : free: 74.49 GB, total: 79.35 GB, used:  4.86 GB
after allocation    : free: 74.49 GB, total: 79.35 GB, used:  4.86 GB
after allocation    : free: 74.63 GB, total: 79.35 GB, used:  4.72 GB
Writing 320 elements
  818   262   938  3155   286  1528    11   257    12    29 
zeroCount = 0
[INFO] request_batch_size 8 beam_width 1 head_num 32 size_per_head 128 total_output_len 40 decoder_layers 30 vocab_size 250880 FT-CPP-decoding-beamsearch-time 183.63 ms
[INFO] request_batch_size 8 beam_width 1 head_num 32 size_per_head 128 total_output_len 40 decoder_layers 30 vocab_size 250880 FT-CPP-decoding-beamsearch-time 183.63 ms
[INFO] request_batch_size 8 beam_width 1 head_num 32 size_per_head 128 total_output_len 40 decoder_layers 30 vocab_size 250880 FT-CPP-decoding-beamsearch-time 183.63 ms
[INFO] request_batch_size 8 beam_width 1 head_num 32 size_per_head 128 total_output_len 40 decoder_layers 30 vocab_size 250880 FT-CPP-decoding-beamsearch-time 183.62 ms
[INFO] request_batch_size 8 beam_width 1 head_num 32 size_per_head 128 total_output_len 40 decoder_layers 30 vocab_size 250880 FT-CPP-decoding-beamsearch-time 183.63 ms
[INFO] request_batch_size 8 beam_width 1 head_num 32 size_per_head 128 total_output_len 40 decoder_layers 30 vocab_size 250880 FT-CPP-decoding-beamsearch-time 183.62 ms
[INFO] request_batch_size 8 beam_width 1 head_num 32 size_per_head 128 total_output_len 40 decoder_layers 30 vocab_size 250880 FT-CPP-decoding-beamsearch-time 183.63 ms
[INFO] request_batch_size 8 beam_width 1 head_num 32 size_per_head 128 total_output_len 40 decoder_layers 30 vocab_size 250880 FT-CPP-decoding-beamsearch-time 183.63 ms
set-zw04-kubernetes-pc38:177175:177175 [7] NCCL INFO comm 0x9a912a0 rank 7 nranks 8 cudaDev 7 busId e7000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177167:177167 [0] NCCL INFO comm 0xa83d1f0 rank 0 nranks 8 cudaDev 0 busId 26000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177175:177175 [7] NCCL INFO comm 0x9b6a4a0 rank 0 nranks 1 cudaDev 7 busId e7000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177167:177167 [0] NCCL INFO comm 0xa9163f0 rank 0 nranks 1 cudaDev 0 busId 26000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177168:177168 [1] NCCL INFO comm 0xaa10300 rank 1 nranks 8 cudaDev 1 busId 2c000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177172:177172 [5] NCCL INFO comm 0xa284240 rank 5 nranks 8 cudaDev 5 busId a7000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177173:177173 [6] NCCL INFO comm 0x9e69340 rank 6 nranks 8 cudaDev 6 busId e1000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177170:177170 [3] NCCL INFO comm 0xa0802b0 rank 3 nranks 8 cudaDev 3 busId 6a000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177171:177171 [4] NCCL INFO comm 0xabe5440 rank 4 nranks 8 cudaDev 4 busId a2000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177169:177169 [2] NCCL INFO comm 0x9fe0220 rank 2 nranks 8 cudaDev 2 busId 65000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177168:177168 [1] NCCL INFO comm 0xaae9500 rank 0 nranks 1 cudaDev 1 busId 2c000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177172:177172 [5] NCCL INFO comm 0xa35d440 rank 0 nranks 1 cudaDev 5 busId a7000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177173:177173 [6] NCCL INFO comm 0x9f42540 rank 0 nranks 1 cudaDev 6 busId e1000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177170:177170 [3] NCCL INFO comm 0xa1594b0 rank 0 nranks 1 cudaDev 3 busId 6a000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177171:177171 [4] NCCL INFO comm 0xacbe640 rank 0 nranks 1 cudaDev 4 busId a2000 - Destroy COMPLETE
set-zw04-kubernetes-pc38:177169:177169 [2] NCCL INFO comm 0xa0b9420 rank 0 nranks 1 cudaDev 2 busId 65000 - Destroy COMPLETE

How about batch size 1?

It works!Thanks a lot.