tensorflow: Segmentation fault with TF 1.7 built from source with MKL support
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): NO
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Amazon Linux (Linux version 4.9.85-38.58.amzn1.x86_64)
- TensorFlow installed from (source or binary): source
- TensorFlow version (use command below): 1.7.0
- Python version: 2.7
- Bazel version (if compiling from source): 0.11.1
- GCC/Compiler version (if compiling from source): 5.3.0
- CUDA/cuDNN version: None
- GPU model and memory: None
- Exact command to reproduce:
git clone -b r1.7 https://github.com/tensorflow/tensorflow.git
cd tensorflow
./configure
bazel build --jobs $(nproc) --config=opt --config=mkl --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --copt="-DEIGEN_USE_VML" //tensorflow/tools/pip_package:build_pip_package
git clone https://github.com/tensorflow/models.git
cd models/tutorials/image/cifar10/
python cifar10_train.py
Result:
[ec2-user@ip-xxx-xx-xx-xxx cifar10]$ python cifar10_train.py
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
Segmentation fault
Describe the problem
I tried to compile TF 1.7 from source with MKL and AVX512 support on my AWS EC2 C5.18xlarge instance. This Segmentation fault appeared when running cifar10_train.py. This error appeared with TF1.5 as well and TF 1.6 fixed it, but now it appears again with 1.7. I have tried to compile TF without AVX512 with following commands and this error didn’t appear again:
bazel build --jobs $(nproc) --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --config=mkl --copt=-msse4.1 --copt=-msse4.2 --copt=-mavx2 --copt=-mavx --copt=-mfma --copt="-DEIGEN_USE_VML" //tensorflow/tools/pip_package:build_pip_package
So I’m guessing this issue is related with AVX512.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (13 by maintainers)
Can you do backtrace on segfault?