builder: aarch64 linux: torch.compile performance is 2x slow with nightly torch wheel compared to the wheel built with 'build_aarch64_wheel.py' script
For torchbench benchmarks with dynamo backend, the aarch64 linux nightly wheel performance is 2x slow compared to the wheel I’ve built using the pytorch/builder/build_aarch64_wheel.py script for the same pytorch commit.
The difference seems to be coming from the https://github.com/pytorch/builder/blob/main/aarch64_linux/aarch64_ci_build.sh used for nightly builds. I suspect it’s with the libomp.
How to reproduce?
git clone https://github.com/pytorch/benchmark.git
cd benchmark
# apply this PR: https://github.com/pytorch/benchmark/pull/2187
# setting omp threads =16, because i'm using c7g.4xl instance
OMP_NUM_THREADS=16 python3 run_benchmark.py cpu --model hf_DistilBert --test eval --torchdynamo inductor --freeze_prepack_weights --metrics="latencies,cpu_peak_mem"
About this issue
- Original URL
- State: closed
- Created 3 months ago
- Comments: 18 (18 by maintainers)
Commits related to this issue
- aarch64: cd: switch from libomp to libgomp (#1787) In the current version of the scripts, torch libraries are linked to llvm openmp becasue conda openblas-openmp is linked to it. to switch to gnu lib... — committed to xuhancn/pytorch_builder by snadampal 2 months ago
- aarch64: cd: switch from libomp to libgomp (#1787) In the current version of the scripts, torch libraries are linked to llvm openmp becasue conda openblas-openmp is linked to it. to switch to gnu lib... — committed to pytorch/builder by snadampal 2 months ago
- aarch64: cd: switch from libomp to libgomp In the current version of the CD scripts, torch libraries are linked to llvm openmp because conda openblas-openmp is linked to it. To switch to gnu libgomp,... — committed to snadampal/builder by snadampal 2 months ago
I would love to help get away from these conda packaged deps and instead use something more OS native (i.e. - build in a container using whats provided via apt, yum, dnf, etc.)
I have upgraded the docker to manylinux 2_28 and removed conda dependency completely, everything installed from manylinux or pypi. This solves the libomp performance issues.
here is the draft PR: https://github.com/pytorch/builder/pull/1781
I had to disable pytorch tests building, via
BUILD_TEST=0
to make it work, that’s where GCC-12 breaks coming from, I will look into it next.ya, I’ll see if I can take a look - still working my through building from source without conda for CUDA based build
Right now the scripts are using
manylinux2014
os which is pretty old, comes with python 3.6, that’s why we had to rely on conda for lot of packages. Given the EOL for it is June 30th, 2024. (link), I’m upgrading the docker tomanylinux_2_28
, the latest one. this might remove conda dependency. I’m trying it.@bryantbiggs I think this is a general direction build process is going towards: no Anaconda, just use what’s in pypa docker
it’s indeed the
libomp.so from conda
in the nightly wheel is causing the issue. I replaced it withdebian libgomp
, the performance is restored, 2x improved. I’m checking how I can switch to libgomp for the aarch64 nightly and release wheel building.