builder: aarch64 linux: torch.compile performance is 2x slow with nightly torch wheel compared to the wheel built with 'build_aarch64_wheel.py' script

For torchbench benchmarks with dynamo backend, the aarch64 linux nightly wheel performance is 2x slow compared to the wheel I’ve built using the pytorch/builder/build_aarch64_wheel.py script for the same pytorch commit.

The difference seems to be coming from the https://github.com/pytorch/builder/blob/main/aarch64_linux/aarch64_ci_build.sh used for nightly builds. I suspect it’s with the libomp.

How to reproduce?

git clone https://github.com/pytorch/benchmark.git
cd benchmark

# apply this PR: https://github.com/pytorch/benchmark/pull/2187

# setting omp threads =16, because i'm using c7g.4xl instance

OMP_NUM_THREADS=16 python3 run_benchmark.py cpu --model hf_DistilBert --test eval --torchdynamo inductor --freeze_prepack_weights --metrics="latencies,cpu_peak_mem"

About this issue

  • Original URL
  • State: closed
  • Created 3 months ago
  • Comments: 18 (18 by maintainers)

Commits related to this issue

Most upvoted comments

I would love to help get away from these conda packaged deps and instead use something more OS native (i.e. - build in a container using whats provided via apt, yum, dnf, etc.)

I have upgraded the docker to manylinux 2_28 and removed conda dependency completely, everything installed from manylinux or pypi. This solves the libomp performance issues.

here is the draft PR: https://github.com/pytorch/builder/pull/1781

I had to disable pytorch tests building, via BUILD_TEST=0 to make it work, that’s where GCC-12 breaks coming from, I will look into it next.

ya, I’ll see if I can take a look - still working my through building from source without conda for CUDA based build

Right now the scripts are using manylinux2014 os which is pretty old, comes with python 3.6, that’s why we had to rely on conda for lot of packages. Given the EOL for it is June 30th, 2024. (link), I’m upgrading the docker to manylinux_2_28, the latest one. this might remove conda dependency. I’m trying it.

@bryantbiggs I think this is a general direction build process is going towards: no Anaconda, just use what’s in pypa docker

it’s indeed the libomp.so from conda in the nightly wheel is causing the issue. I replaced it with debian libgomp, the performance is restored, 2x improved. I’m checking how I can switch to libgomp for the aarch64 nightly and release wheel building.