cmssw: significant slow-down of tensorflow on non-AVX machine(s)
Originally from https://mattermost.web.cern.ch/cms-o-and-c/pl/zrtbufg8zbb9jgspeuxef183rc
I learned that TF inference is much slower on an older AMD compared to Intel.
both are running the same inputs in a bit older release where I had input data and where igprof was still working fine
one example call to mkldnn_sgemm has a very large difference in two cases, about a factor of 1000 less on Intel (look at % total):
https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.int34/2651
[From @makortel ] Some slowdown was observed e.g. in https://mathematica.stackexchange.com/questions/64645/mkl-on-intel-vs-amd
I have a suspicion that we are using https://github.com/oneapi-src/oneDNN/blob/v1.0.4/src/cpu/gemm/gemm.cpp
Here (mkldnn_sgemm calls extended_sgemm, which in tern makes a choice between gemm_driver [igprof cost 0.02%] or ref_gemm<float> [igprof cost 30%])
If that’s correct, then my analysis is that mkldnn_sgemm is common in both cases and it’s really just this method implementation that differs by selecting for SSE4.1 flag.
Then the difference in speed is close to 1000. This does not look reasonable. A better understanding of what we actually compile here would help to confirm. (it may be straightforward to modify and more clearly confirm that ref_gemm is really so slow)
Goals towards resolving the issue:
- understand which oneDNN is used to compile our tensorflow
- see if there is a faster solution for the older arch (pre-SSE4.1?)
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 31 (31 by maintainers)
Commits related to this issue
- [11.3] Build tensorflow to check mkldnn version see https://github.com/cms-sw/cmssw/issues/33442 — committed to cms-sw/cmsdist by mrodozov 3 years ago
enable profiling