inference: DLRMv2 GPU Reference Implementation crashes with BusError

First issue - GPU Dockerfile hasn’t been fixed since I brought it up in !1373 . Had to replace it with this one I left in the comments: https://github.com/mlcommons/inference/pull/1373#issuecomment-1578510609

System is a DGX-A100 machine with 8x A100-SXM-80GB.

In the GPU Docker, running:

$ ./run_local.sh pytorch dlrm multihot-criteo gpu --scenario Offline --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=2048 --samples-per-query-offline=204800 --accuracy

yields

INFO:root:Using on-device cache with admission algorithm CacheAlgorithm.LRU, 1250000 sets, load_factor:  0.200,  19.07GB
INFO:root:Using fused exact_sgd with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.01, eps=1e-08, beta1=0.9, beta2=0.999, weight_decay=0.0, weight_decay_mode=0, eta=0.001, momentum=0.9)
Loading model weights...
INFO:torchsnapshot.scheduler:Set process memory budget to 34359738368 bytes.
INFO:torchsnapshot.scheduler:Rank 0 finished loading. Throughput: 111.65MB/s
INFO:main:starting TestScenario.Offline
./run_local.sh: line 14:   153 Bus error               (core dumped) python python/main.py --profile $profile $common_opt --model $model --model-path $model_path --dataset $dataset --dataset-path $DATA_DIR --output $OUTPUT_DIR $EXTRA_OPS $@

Full output in comments.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 18 (16 by maintainers)

Most upvoted comments

=> du -sh /home/mlperf_inf_dlrmv2/criteo/day23
169G    /home/mlperf_inf_dlrmv2/criteo/day23

The day23 files are around ~169 GB. This is the breakdown:

8.7G    /home/mlperf_inf_dlrmv2/criteo/day23/day_23_dense.npy
681M    /home/mlperf_inf_dlrmv2/criteo/day23/day_23_labels.npy
143G    /home/mlperf_inf_dlrmv2/criteo/day23/day_23_sparse_multi_hot.npz
18G     /home/mlperf_inf_dlrmv2/criteo/day23/day_23_sparse.npy

Thanks @nv-etcheng . Could you also share how much (smaller) disk space is needed to store the preprocessed data? I’m thinking that users could temporarily acquire the larger disk space, but they’d find it useful to also know how much disk space is needed to store the preprocessed data to keep submitting in each round.

@pgmpablo157321 Could you add the disk space requirement to the DLRMv2 documentation so that users are prepared for it?

@arjunsuresh I believe that it took around 6.1 TB of disk space to run the Criteo preprocessing script when I ran it. Not sure how much it would take if the scripts were modified to only process day 23, but 6.1 TB is the size of the raw data, numpy preprocessed, and synthetic multihot datasets for all 24 days.