tensorflow: Segmentation fault with TF 1.7 built from source with MKL support

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): NO
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Amazon Linux (Linux version 4.9.85-38.58.amzn1.x86_64)
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.7.0
  • Python version: 2.7
  • Bazel version (if compiling from source): 0.11.1
  • GCC/Compiler version (if compiling from source): 5.3.0
  • CUDA/cuDNN version: None
  • GPU model and memory: None
  • Exact command to reproduce:
git clone -b r1.7 https://github.com/tensorflow/tensorflow.git
cd tensorflow
./configure
bazel build --jobs $(nproc) --config=opt --config=mkl --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --copt="-DEIGEN_USE_VML" //tensorflow/tools/pip_package:build_pip_package

git clone https://github.com/tensorflow/models.git
cd models/tutorials/image/cifar10/
python cifar10_train.py

Result:

[ec2-user@ip-xxx-xx-xx-xxx cifar10]$ python cifar10_train.py
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
Segmentation fault

Describe the problem

I tried to compile TF 1.7 from source with MKL and AVX512 support on my AWS EC2 C5.18xlarge instance. This Segmentation fault appeared when running cifar10_train.py. This error appeared with TF1.5 as well and TF 1.6 fixed it, but now it appears again with 1.7. I have tried to compile TF without AVX512 with following commands and this error didn’t appear again:

bazel build --jobs $(nproc) --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --config=mkl --copt=-msse4.1 --copt=-msse4.2 --copt=-mavx2 --copt=-mavx --copt=-mfma --copt="-DEIGEN_USE_VML" //tensorflow/tools/pip_package:build_pip_package

So I’m guessing this issue is related with AVX512.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15 (13 by maintainers)

Most upvoted comments

Can you do backtrace on segfault?

ulimit -Sc unlimited
<crashy script>
gdb python
core core
bt