cmssw: significant slow-down of tensorflow on non-AVX machine(s)

Originally from https://mattermost.web.cern.ch/cms-o-and-c/pl/zrtbufg8zbb9jgspeuxef183rc

I learned that TF inference is much slower on an older AMD compared to Intel.

Intel Broadwell: https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.int34/133

AMD Opteron 6128 https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.wn36/29

both are running the same inputs in a bit older release where I had input data and where igprof was still working fine

one example call to mkldnn_sgemm has a very large difference in two cases, about a factor of 1000 less on Intel (look at % total): https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.int34/2651

vs https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.wn36/30

[From @makortel ] Some slowdown was observed e.g. in https://mathematica.stackexchange.com/questions/64645/mkl-on-intel-vs-amd

I have a suspicion that we are using https://github.com/oneapi-src/oneDNN/blob/v1.0.4/src/cpu/gemm/gemm.cpp Here (mkldnn_sgemm calls extended_sgemm, which in tern makes a choice between gemm_driver [igprof cost 0.02%] or ref_gemm<float> [igprof cost 30%])

If that’s correct, then my analysis is that mkldnn_sgemm is common in both cases and it’s really just this method implementation that differs by selecting for SSE4.1 flag. Then the difference in speed is close to 1000. This does not look reasonable. A better understanding of what we actually compile here would help to confirm. (it may be straightforward to modify and more clearly confirm that ref_gemm is really so slow)

Goals towards resolving the issue:

understand which oneDNN is used to compile our tensorflow
see if there is a faster solution for the older arch (pre-SSE4.1?)

About this issue

Original URL
State: open
Created 3 years ago
Comments: 31 (31 by maintainers)

Commits related to this issue

[11.3] Build tensorflow to check mkldnn version see https://github.com/cms-sw/cmssw/issues/33442 — committed to cms-sw/cmsdist by mrodozov 3 years ago

Most upvoted comments

enable profiling

mrodozov on May 4, 2021