tensorflow: [Train on official Docker image & custom TF build (no AVX)] failed to query event: CUDA_ERROR_LAUNCH_FAILED

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): used https://github.com/minimaxir/gpt-2-simple for finetuning
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04 & Docker
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
TensorFlow installed from (source or binary): custom build TF (without AVX) from https://github.com/yaroslavvb/tensorflow-community-wheels/issues/109
TensorFlow version (use command below): b’v1.13.1-0-g6612da8’ 1.13.1
Python version: Python 3.5.2
Bazel version (if compiling from source): 0.21.0
GCC/Compiler version (if compiling from source): Same as docker tensorflow/tensorflow:latest-gpu-py3 & tensorflow/tensorflow:devel-gpu-py3 (error reproduce in both)
CUDA/cuDNN version: 10.0 / 7.4.1.5-1 ( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dockerfiles/dockerfiles/devel-gpu.Dockerfile )
GPU model and memory: GeForce GTX 1080 Ti (11 Gb)

Describe the current behavior

root@63f592f02a0e:/gpt2# python imdb_reviews.py … 2019-05-14 07:05:22.669722: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure 2019-05-14 07:05:22.669812: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1 Aborted (core dumped)

Describe the expected behavior Train process (tested in google colab)…

Code to reproduce the issue

# tested on devel-gpu-py3 (where tf wheel builded) and latest-gpu-py3
docker pull tensorflow/tensorflow:latest-gpu-py3

# deep into docker image
docker run --runtime=nvidia -it  -v ~/projects/gpt2-simple:/gpt2 tensorflow/tensorflow:latest-gpu-py3 bash

In Docker:

pip uninstall tensorflow-gpu

cd /gpt2
# from https://github.com/yaroslavvb/tensorflow-community-wheels/issues/109
pip install tensorflow-1.13.1-cp35-cp35m-linux_x86_64.whl

# install gpt2-simple
pip install gpt_2_simple


# from https://gist.github.com/saippuakauppias/4f41ce1072a04588a2bab7dae00f9bb7
python imdb_reviews.py

Other info / logs

tf_env.txt attached.
Its error looks like https://github.com/tensorflow/tensorflow/issues/22477 but I dont understand where solution in that issue.
docker run without -u $(id -u):$(id -g) because with this I cant uninstall tensorflow:

Example

docker run --runtime=nvidia -u $(id -u):$(id -g) -v ~/projects/gpt2-simple:/gpt2 -it tensorflow/tensorflow:latest-gpu-py3 bash

You are running this container as user with ID 1000 and group 1000,
which should map to the ID and group for your user on the Docker host. Great!

tf-docker / > pip uninstall tensorflow-gpu
WARNING: The directory '/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Uninstalling tensorflow-gpu-1.13.1:
  Would remove:
    /usr/local/bin/freeze_graph
    /usr/local/bin/saved_model_cli
    /usr/local/bin/tensorboard
    /usr/local/bin/tf_upgrade_v2
    /usr/local/bin/tflite_convert
    /usr/local/bin/toco
    /usr/local/bin/toco_from_protos
    /usr/local/lib/python3.5/dist-packages/tensorflow/*
    /usr/local/lib/python3.5/dist-packages/tensorflow_gpu-1.13.1.dist-info/*
Proceed (y/n)? y
ERROR: Exception:
Traceback (most recent call last):
  File "/usr/lib/python3.5/shutil.py", line 538, in move
    os.rename(src, real_dst)
PermissionError: [Errno 13] Permission denied: '/usr/local/bin/freeze_graph' -> '/tmp/pip-uninstall-n9t_qerr/freeze_graph'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/cli/base_command.py", line 178, in main
    status = self.run(options, args)
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/commands/uninstall.py", line 75, in run
    auto_confirm=options.yes, verbose=self.verbosity > 0,
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/req/req_install.py", line 825, in uninstall
    uninstalled_pathset.remove(auto_confirm, verbose)
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/req/req_uninstall.py", line 388, in remove
    moved.stash(path)
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/req/req_uninstall.py", line 277, in stash
    renames(path, new_path)
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/utils/misc.py", line 305, in renames
    shutil.move(old, new)
  File "/usr/lib/python3.5/shutil.py", line 553, in move
    os.unlink(src)
PermissionError: [Errno 13] Permission denied: '/usr/local/bin/freeze_graph'

Tensorflow debugger with mnist works fine and use GPU (I see it in nvidia-smi).

Log

root@4697d32838f4:/gpt2# python -m tensorflow.python.debug.examples.debug_mnist
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/debug/examples/debug_mnist.py:46: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/mnist_data/train-images-idx3-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/mnist_data/train-labels-idx1-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting /tmp/mnist_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist_data/t10k-labels-idx1-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2019-05-14 17:07:28.376774: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2800000000 Hz
2019-05-14 17:07:28.378926: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x6115c40 executing computations on platform Host. Devices:
2019-05-14 17:07:28.378987: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-14 17:07:28.598760: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-14 17:07:28.601785: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x61c1dd0 executing computations on platform CUDA. Devices:
2019-05-14 17:07:28.601829: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-05-14 17:07:28.602453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
pciBusID: 0000:0b:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2019-05-14 17:07:28.602502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-14 17:07:28.605463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 17:07:28.605502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-14 17:07:28.605525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-14 17:07:28.605978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10470 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0b:00.0, compute capability: 6.1)
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-05-14 17:07:29.756412: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
Accuracy at step 0: 0.1094
Accuracy at step 1: 0.098
Accuracy at step 2: 0.098
Accuracy at step 3: 0.098
Accuracy at step 4: 0.098
Accuracy at step 5: 0.098
Accuracy at step 6: 0.098
Accuracy at step 7: 0.098
Accuracy at step 8: 0.098
Accuracy at step 9: 0.098

tensorflow: [Train on official Docker image & custom TF build (no AVX)] failed to query event: CUDA_ERROR_LAUNCH_FAILED

About this issue

Most upvoted comments