inference: GPT-J implementation is giving segmentation fault on Nvidia GPU
I’m able to successfully run the gpt-j model using the shared pretrained model on CPU. the below result is just for a single input.
Results
{'rouge1': 32.7869, 'rouge2': 6.7797, 'rougeL': 22.9508, 'rougeLsum': 29.5082, 'gen_len': 212, 'gen_num': 1}
But on Nvidia GPU (4090 with 24GB memory) I’m getting segmentation fault as follows.
Run Directory: /home/arjun/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-dataset-cnndm
CMD: cd '/home/arjun/CM/repos/local/cache/e15bfdcfabe545b8/inference/language/gpt-j' && /usr/bin/python3 main.py --model-path=/home/arjun/checkpoint-final --dataset-path=/home/arjun/CM/repos/local/cache/31b8797c52f043c1/install/cnn_eval.json --scenario Offline --mlperf_conf '/home/arjun/CM/repos/local/cache/e15bfdcfabe545b8/inference/mlperf.conf' --user_conf '/home/arjun/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/generate-mlperf-inference-user-conf/tmp/3cbe44644bca4f84aaf7a07322c6cc28.conf' --gpu
2023-06-01 20:46:57.721822: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-01 20:46:57.773481: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loading PyTorch model...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00, 2.75s/it]
Casting models to GPU...
Constructing QSL
Encoding Samples
Number of Samples in query_samples : 8
./run.sh: line 32: 310967 Segmentation fault (core dumped) /usr/bin/python3 main.py --model-path=/home/arjun/checkpoint-final --dataset-path=/home/arjun/CM/repos/local/cache/31b8797c52f043c1/install/cnn_eval.json --scenario Offline --mlperf_conf '/home/arjun/CM/repos/local/cache/e15bfdcfabe545b8/inference/mlperf.conf' --user_conf '/home/arjun/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/generate-mlperf-inference-user-conf/tmp/3cbe44644bca4f84aaf7a07322c6cc28.conf' --gpu
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 32 (30 by maintainers)
@arjunsuresh Yes, KVCache boosts the runtime by a lot so this is sort of expected. Another options is to parallelize the inference. I saw some hints in this file: https://github.com/huggingface/transformers/blob/fabe17a726bbf6081cfbcc975d8ac451a81f3e2d/src/transformers/models/gptj/modeling_gptj.py#L752 but I haven’t tried it yet.
The issue of allowing GPT-J in the edge category was discussed at the 6/20 working group meeting. Resolution is to allow the benchmark in “Edge category”. Multistream scenario is not appropriate for this benchmark and will be disallowed.
@pgmpablo157321 Please make the appropriate submission checker changes.
@hanyunfan That’s because you have target_qps as 140. You can try with target_qps=1.Then, it’ll run for minimum number of queries in offline scenario, which is 24576.
Thank you @nvzhihanj for that. Unfortunately, I have only a single GPU but this can be useful for @psyhtest
Yes, rather bizarre that GPT-J is even considered for Edge.
Especially if you look at the distribution of latencies from a sample run of the reference implementation on the A5000 GPU (with 2 beams, which barely fits 24G):
@psyhtest @mrasquinha-g the torch implementation of GPT-J causal model is quite sub-optimal, and below is the required GPU memory for running: FP16, beam_size=4, BS=1: 37 GB FP32, beam_size=4 BS=1: 78 GB
To save memory, you can either turn off KV cache, or use CPU memory. But note these will slow down the inference.