inference: GPT-J implementation is giving segmentation fault on Nvidia GPU

I’m able to successfully run the gpt-j model using the shared pretrained model on CPU. the below result is just for a single input.

Results

{'rouge1': 32.7869, 'rouge2': 6.7797, 'rougeL': 22.9508, 'rougeLsum': 29.5082, 'gen_len': 212, 'gen_num': 1}

But on Nvidia GPU (4090 with 24GB memory) I’m getting segmentation fault as follows.

Run Directory: /home/arjun/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-dataset-cnndm

CMD: cd '/home/arjun/CM/repos/local/cache/e15bfdcfabe545b8/inference/language/gpt-j' &&  /usr/bin/python3 main.py --model-path=/home/arjun/checkpoint-final --dataset-path=/home/arjun/CM/repos/local/cache/31b8797c52f043c1/install/cnn_eval.json --scenario Offline  --mlperf_conf '/home/arjun/CM/repos/local/cache/e15bfdcfabe545b8/inference/mlperf.conf' --user_conf '/home/arjun/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/generate-mlperf-inference-user-conf/tmp/3cbe44644bca4f84aaf7a07322c6cc28.conf' --gpu

2023-06-01 20:46:57.721822: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-01 20:46:57.773481: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loading PyTorch model...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00,  2.75s/it]
Casting models to GPU...
Constructing QSL
Encoding Samples
Number of Samples in query_samples :  8
./run.sh: line 32: 310967 Segmentation fault      (core dumped) /usr/bin/python3 main.py --model-path=/home/arjun/checkpoint-final --dataset-path=/home/arjun/CM/repos/local/cache/31b8797c52f043c1/install/cnn_eval.json --scenario Offline --mlperf_conf '/home/arjun/CM/repos/local/cache/e15bfdcfabe545b8/inference/mlperf.conf' --user_conf '/home/arjun/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/generate-mlperf-inference-user-conf/tmp/3cbe44644bca4f84aaf7a07322c6cc28.conf' --gpu

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 32 (30 by maintainers)

Most upvoted comments

@arjunsuresh Yes, KVCache boosts the runtime by a lot so this is sort of expected. Another options is to parallelize the inference. I saw some hints in this file: https://github.com/huggingface/transformers/blob/fabe17a726bbf6081cfbcc975d8ac451a81f3e2d/src/transformers/models/gptj/modeling_gptj.py#L752 but I haven’t tried it yet.

The issue of allowing GPT-J in the edge category was discussed at the 6/20 working group meeting. Resolution is to allow the benchmark in “Edge category”. Multistream scenario is not appropriate for this benchmark and will be disallowed.

@pgmpablo157321 Please make the appropriate submission checker changes.

@hanyunfan That’s because you have target_qps as 140. You can try with target_qps=1.Then, it’ll run for minimum number of queries in offline scenario, which is 24576.

Thank you @nvzhihanj for that. Unfortunately, I have only a single GPU but this can be useful for @psyhtest

It’s strange that bert-99.9 is not in edge category but gpt-j

Yes, rather bizarre that GPT-J is even considered for Edge.

Especially if you look at the distribution of latencies from a sample run of the reference implementation on the A5000 GPU (with 2 beams, which barely fits 24G):

  • Min: 2.0 s
  • Mean: 5.2 s
  • 90th percentile: 8.2 s
  • 99th percentile: 9.2 s
================================================
MLPerf Results Summary
================================================
SUT name : PySUT
Scenario : SingleStream
Mode     : PerformanceOnly
90th percentile latency (ns) : 8183313239
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes
Early Stopping Result:
 * Processed at least 64 queries (117).
 * Would discard 3 highest latency queries.
 * Early stopping 90th percentile estimate: 8797132675
 * Not enough queries processed for 99th percentile
 early stopping estimate (would need to process at
 least 662 total queries).
================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 0.19
QPS w/o loadgen overhead        : 0.19
Min latency (ns)                : 1971737790
Max latency (ns)                : 9264584392
Mean latency (ns)               : 5183591908
50.00 percentile latency (ns)   : 4911315309
90.00 percentile latency (ns)   : 8183313239
95.00 percentile latency (ns)   : 8537941592
97.00 percentile latency (ns)   : 8797132675
99.00 percentile latency (ns)   : 9018680756
99.90 percentile latency (ns)   : 9264584392

@psyhtest @mrasquinha-g the torch implementation of GPT-J causal model is quite sub-optimal, and below is the required GPU memory for running: FP16, beam_size=4, BS=1: 37 GB FP32, beam_size=4 BS=1: 78 GB

To save memory, you can either turn off KV cache, or use CPU memory. But note these will slow down the inference.