tensorflow: Performance problem with TensorFlow training
Hello, We are running a not-very-complex 3D convolution problem had we have extremely poor performance. Here is the summary of our problem
Now the technical part.
I am running on an Haswell CPU in a Mac OS running High Sierra.
Model Name: MacBook Pro Model Identifier: MacBookPro11,5 Processor Name: Intel Core i7 Processor Speed: 2.5 GHz Number of Processors: 1 Total Number of Cores: 4 L2 Cache (per Core): 256 KB L3 Cache: 6 MB Memory: 16 GB
Tensorflow performance
1.) Memory allocation
Memory allocation seems highly unoptimized. I see an allocation of ~80GB (78M allocations) out of which we are left with 37GB persistent (corresponding to 575k permanent allocations). The memory churn is enormous and this may affect very seriously performance. Most of those are very small allocation / deallocation which happen here
Eigen::NonBlockingThreadPoolTempltensorflow::thread::EigenEnvironment::Schedule(std::__1::function<void ()>)
I have tried to run with tcmalloc from google hoping to improve memory allocation and handling. tcmalloc complains that there are the following “large allocation” from TF even before starting the epochs: (23+23+2+18+4+9+2) = 81GB of allocation even before starting the epoch’s. After that my disk is full of swap files and my machine dies.
Then I went back running with the Mac allocator, which surprisingly seems to be more robust.
2.) Code performance
A careful VTune analysis performed by Sofia has identified Eigen as the major source of CPU consumption. All the time is wasted simply in repacking (gemm_pack_rhs).
To look at the code I attempted to compile with -g, however the default compilation is with -g0 and I could not find yet a way to replace this default on bazel. I added -g3 that, according to the manual (and to a small test I have made) should override -g0. However the Mac Instrument (a poor relation of VTune on Mac) could not find out the source. The library should be _pywrap_tensorflow_internal.so. Then I went looking for the source and I found that gemm_pack_rhs::operator() is defined in the following files
./bazel-tensorflow/external/eigen_archive/Eigen/src/Core/products/GeneralBlockPanelKernel.h ./bazel-tensorflow/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProductAVX2.h ./third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/MatMatProductAVX2.h
The last two are identical. Putting good old printf’s I discovered we are calling the GeneralBlockPanelKernel.h version. The operator works with packets of size 8, which is fine for AVX2 (256) and float32 as we are using. However I am not sure that the compiler manages to vectorize this procedure. Indeed, most of the time is spent in line 559 of eigen_volume_patch.h.
for (int i = 0; i < PacketSize; ++i) { values[i] = coeff(index + i); }
The packet structure of the code is meant to have each packet treated as a unit. This for loop simply destroys all possibility of optimisation. There is a lot of room for optimisation in tensorflow before we get really serious about performance with a problem like ours. But who is going to pick up the tab?
3.) MKL or not MKL.
When bringing up this problem, I have been answered that tensorflow in Mac does not support the usage of MKL, and therefore, till then, my findings were not entirely relevant. MKL for Mac exists, however clang does not support OMP (or rather the default version of clang distributed with Mac does not have OMP support enabled). So the only way to compile tensorflow on the Mac with MKL was to change compiler.
Unfortunately changing compiler with bazel on the Mac seems a very ambitious proposition. After posting to and perusing stackoverflow, bazel forum and tensorflow forum, I came to the following recipe
export BAZEL_USE_CPP_ONLY_TOOLCHAIN=1 export CC=/path/to/compiler bazel build […]
does indeed force Bazel to use a new compiler, however controlling the compiler switches is much more complicated. The two compiler flags -Wthread-safety and -Wself-assign, as well as the linker flag “-no-as-needed” and “-z” are incompatible with g++ linker. The CROSSTOOL.tpl are automatically generated during configuration. The only occurrences of (for instance) -Wself-assign in the TF code are in
third_party/gpus/crosstool/CROSSTOOL_nvcc.tpl third_party/toolchains/clang6/CROSSTOOL.tpl third_party/toolchains/cpus/arm/CROSSTOOL.tpl
but even if I comment the lines:
- compiler_flag: “-Wthread-safety”
- compiler_flag: “-Wself-assign” +# compiler_flag: “-Wthread-safety” +# compiler_flag: "-Wself-assign”
in all three of them “something” creates a CROSSTOOL.tpl with these flags in. The hack I am trying now is to configure and then edit the file
./bazel-tensorflow/external/local_config_cc/cc_wrapper.sh
adding the following line
/sw/bin/gcc-fsf-7 echo "$@" | sed -e 's/-Wself-assign//' | sed -e 's/-Wthread-safety//' | sed -e 's/-Wl,-no-as-needed//' | sed -e 's/-Wl,-z,relro,-z,now//‘
which is a very poor hack.
With this I could build a version of tensorflow using the Mac MKL, but to no avail. Performance is still abysmal with the same bottleneck.
Thanks for reading up to here…
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 3
- Comments: 27 (11 by maintainers)
@fcacarminati Unfortunately in your case the volume patch data is scattered across all of the input buffer, and there is no contiguous blocks of 8+ scalars, so TensorVolumePatchOp packet access has to construct packet by loading scalars one by one.
One option to make it faster, is to create custom LHS/RHS packer, similar to what you can find in eigen_spatial_convolutions, I’ll take a look on it, it does seem like relatively simple thing. That change made Conv2D ~2x faster, so I expect to see somewhat similar results for Conv3D.