tensorflow: ROCM: Segmentation fault late in build process
Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Manjaro, completely updated
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary): compiling from source
- TensorFlow version: 2.5.0
- Python version: 3.9.5
- Installed using virtualenv? pip? conda?:
- Bazel version (if compiling from source): 4.0.0
- GCC/Compiler version (if compiling from source):10.2.0-3
- CUDA/cuDNN version: cuda: 11.3.0-2,m, cudnn: 8.2.0.53-1
- GPU model and memory: AMD rx570 4GB
Describe the problem
Problem is the build fails
Provide the exact sequence of commands / steps that you executed before running into the problem
Building this aur package https://github.com/rocm-arch/tensorflow-rocm
After many hours of compilation I get a segmentation fault.
Any other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f16_i1_kernel_generator_kernel.o [for host]; 2s local
compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f64_i1_kernel_generator_kernel.o [fERROR: /tmp/trizen-mario/tensorflow-rocm/src/tensorflow-2.5.0-rocm/tensorflow/core/kernels/mlir_generated/BUILD:957:23: compile tensorflow/core/kernels/mlir_generated/is_finite_gpu_f16_i1_kernel_generator_kernel.o [for host] failed: (Segmentation fault): tf_to_kernel failed: error executing command bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel '--unroll_factors=4' '--tile_sizes=256' '--arch=gfx701,gfx702,gfx803,gfx900,gfx904,gfx906,gfx908' ... (remaining 4 argument(s) skipped)
[20,437 / 21,317] 11 actions running
compile tensorflow/core/kernels/mlir_generated/is_finite_gpu_f64_i1_kernel_generator_kernel.o [for host]; 3s local
compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f16_i1_kernel_generator_kernel.o [for host]; 2s local
compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f64_i1_kernel_generator_kernel.o [for host]; 2s local
compile tensorflow/core/kernels/mlir_generated/is_nan_gpu_f16_i1_kernel_generator_kernel.o [f2021-06-18 15:06:05.299828: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
0x556475d2e878: i1 = FP_CLASS 0x5564779f2e08, Constant:i32<504>TensorFlow crashed, please file a bug on https://github.com/tensorflow/tensorflow/issues with the trace below.
Stack dump:
0. Program arguments: bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel --unroll_factors=4 --tile_sizes=256 --arch=gfx701,gfx702,gfx803,gfx900,gfx904,gfx906,gfx908 --input=bazel-out/host/bin/tensorflow/core/kernels/mlir_generated/is_finite_gpu_f16_i1.mlir --output=bazel-out/host/bin/tensorflow/core/kernels/mlir_generated/is_finite_gpu_f16_i1_kernel_generator_kernel.o --enable_ftz=False --cpu_codegen=False
1. 2. Running pass 'CallGraph Pass Manager' on module 'acme'.
3. Running pass 'AMDGPU DAG->DAG Pattern Instruction Selection' on function '@IsFinite_GPU_DT_HALF_DT_BOOL_kernel'
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x408d943)[0x556473344943]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x408bb0d)[0x556473342b0d]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x408bc94)[0x556473342c94]
/usr/lib/libpthread.so.0(+0x13870)[0x7f9763e9a870]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2b070e8)[0x556471dbe0e8]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x18acc23)[0x556470b63c23]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2a9adf2)[0x556471d51df2]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2b6f3b6)[0x556471e263b6]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2bb88c6)[0x556471e6f8c6]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2b6fa1e)[0x556471e26a1e]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2b6fb98)[0x556471e26b98]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2a80a73)[0x556471d37a73]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2a837e0)[0x556471d3a7e0]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2a856a6)[0x556471d3c6a6]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x2e0f83f)[0x5564720c683f]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x3f03645)[0x5564731ba645]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x3bc50e7)[0x556472e7c0e7]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x3f030a1)[0x5564731ba0a1]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x169de1d)[0x556470954e1d]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x16a2bef)[0x556470959bef]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0xbe3199)[0x55646fe9a199]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x361381d)[0x5564728ca81d]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x361394a)[0x5564728ca94a]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x361427b)[0x5564728cb27b]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x3612b6f)[0x5564728c9b6f]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x36133ac)[0x5564728ca3ac]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x361394a)[0x5564728ca94a]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x3615c06)[0x5564728ccc06]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x7f78dd)[0x55646faae8dd]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x6b5eb8)[0x55646f96ceb8]
/usr/lib/libc.so.6(__libc_start_main+0xd5)[0x7f9763348b25]
bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel(+0x7f060e)[0x55646faa760e]
[20,437 / 21,317] 11 actions running
compile tensorflow/core/kernels/mlir_generated/is_finite_gpu_f64_i1_kernel_generator_kernel.o [for host]; 3s local
compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f16_i1_kernel_generator_kernel.o [for host]; 2s local
compile tensorflow/core/kernels/mlir_generated/is_inf_gpu_f64_i1_kernel_generator_kernel.o [for host]; 2s local
compile tensorflow/core/kernels/mlir_generated/is_nan_gpu_f16_i1_kernel_generator_kernel.o [fERROR: /tmp/trizen-mario/tensorflow-rocm/src/tensorflow-2.5.0-rocm/tensorflow/tools/pip_package/BUILD:284:10 Middleman _middlemen/tensorflow_Stools_Spip_Upackage_Sbuild_Upip_Upackage-runfiles failed: (Segmentation fault): tf_to_kernel failed: error executing command bazel-out/host/bin/tensorflow/compiler/mlir/tools/kernel_gen/tf_to_kernel '--unroll_factors=4' '--tile_sizes=256' '--arch=gfx701,gfx702,gfx803,gfx900,gfx904,gfx906,gfx908' ... (remaining 4 argument(s) skipped)
INFO: Elapsed time: 11598.345s, Critical Path: 268.21s
INFO: 20448 processes: 1439 internal, 19009 local.
FAILED: Build did NOT complete successfully
==> ERROR: A failure occurred in build().
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (11 by maintainers)
--define=tensorflow_enable_mlir_generated_gpu_kernels=0disables the new MLIR-generated kernels. This concerns the kernels that you use in TF eager mode. Instead, TF will fall back to use the old Eigen-based kernels. I would expect this to hurt performance and you may have fewer kernels and data types, i.e. type coverage that we recently extended will not be available to you. Depending on the models, it is in principle possible that some would not run.