tensorflow: [Train on official Docker image & custom TF build (no AVX)] failed to query event: CUDA_ERROR_LAUNCH_FAILED
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): used https://github.com/minimaxir/gpt-2-simple for finetuning
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04 & Docker
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
- TensorFlow installed from (source or binary): custom build TF (without AVX) from https://github.com/yaroslavvb/tensorflow-community-wheels/issues/109
- TensorFlow version (use command below): b’v1.13.1-0-g6612da8’ 1.13.1
- Python version: Python 3.5.2
- Bazel version (if compiling from source): 0.21.0
- GCC/Compiler version (if compiling from source): Same as docker
tensorflow/tensorflow:latest-gpu-py3
&tensorflow/tensorflow:devel-gpu-py3
(error reproduce in both) - CUDA/cuDNN version: 10.0 / 7.4.1.5-1 ( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dockerfiles/dockerfiles/devel-gpu.Dockerfile )
- GPU model and memory: GeForce GTX 1080 Ti (11 Gb)
Describe the current behavior
root@63f592f02a0e:/gpt2# python imdb_reviews.py … 2019-05-14 07:05:22.669722: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure 2019-05-14 07:05:22.669812: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1 Aborted (core dumped)
Describe the expected behavior Train process (tested in google colab)…
Code to reproduce the issue
# tested on devel-gpu-py3 (where tf wheel builded) and latest-gpu-py3
docker pull tensorflow/tensorflow:latest-gpu-py3
# deep into docker image
docker run --runtime=nvidia -it -v ~/projects/gpt2-simple:/gpt2 tensorflow/tensorflow:latest-gpu-py3 bash
In Docker:
pip uninstall tensorflow-gpu
cd /gpt2
# from https://github.com/yaroslavvb/tensorflow-community-wheels/issues/109
pip install tensorflow-1.13.1-cp35-cp35m-linux_x86_64.whl
# install gpt2-simple
pip install gpt_2_simple
# from https://gist.github.com/saippuakauppias/4f41ce1072a04588a2bab7dae00f9bb7
python imdb_reviews.py
Other info / logs
- tf_env.txt attached.
- Its error looks like https://github.com/tensorflow/tensorflow/issues/22477 but I dont understand where solution in that issue.
- docker run without
-u $(id -u):$(id -g)
because with this I cant uninstall tensorflow:
Example
docker run --runtime=nvidia -u $(id -u):$(id -g) -v ~/projects/gpt2-simple:/gpt2 -it tensorflow/tensorflow:latest-gpu-py3 bash
You are running this container as user with ID 1000 and group 1000,
which should map to the ID and group for your user on the Docker host. Great!
tf-docker / > pip uninstall tensorflow-gpu
WARNING: The directory '/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Uninstalling tensorflow-gpu-1.13.1:
Would remove:
/usr/local/bin/freeze_graph
/usr/local/bin/saved_model_cli
/usr/local/bin/tensorboard
/usr/local/bin/tf_upgrade_v2
/usr/local/bin/tflite_convert
/usr/local/bin/toco
/usr/local/bin/toco_from_protos
/usr/local/lib/python3.5/dist-packages/tensorflow/*
/usr/local/lib/python3.5/dist-packages/tensorflow_gpu-1.13.1.dist-info/*
Proceed (y/n)? y
ERROR: Exception:
Traceback (most recent call last):
File "/usr/lib/python3.5/shutil.py", line 538, in move
os.rename(src, real_dst)
PermissionError: [Errno 13] Permission denied: '/usr/local/bin/freeze_graph' -> '/tmp/pip-uninstall-n9t_qerr/freeze_graph'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/cli/base_command.py", line 178, in main
status = self.run(options, args)
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/commands/uninstall.py", line 75, in run
auto_confirm=options.yes, verbose=self.verbosity > 0,
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/req/req_install.py", line 825, in uninstall
uninstalled_pathset.remove(auto_confirm, verbose)
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/req/req_uninstall.py", line 388, in remove
moved.stash(path)
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/req/req_uninstall.py", line 277, in stash
renames(path, new_path)
File "/usr/local/lib/python3.5/dist-packages/pip/_internal/utils/misc.py", line 305, in renames
shutil.move(old, new)
File "/usr/lib/python3.5/shutil.py", line 553, in move
os.unlink(src)
PermissionError: [Errno 13] Permission denied: '/usr/local/bin/freeze_graph'
- Tensorflow debugger with mnist works fine and use GPU (I see it in nvidia-smi).
Log
root@4697d32838f4:/gpt2# python -m tensorflow.python.debug.examples.debug_mnist
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/debug/examples/debug_mnist.py:46: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/mnist_data/train-images-idx3-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/mnist_data/train-labels-idx1-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting /tmp/mnist_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist_data/t10k-labels-idx1-ubyte.gz
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2019-05-14 17:07:28.376774: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2800000000 Hz
2019-05-14 17:07:28.378926: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x6115c40 executing computations on platform Host. Devices:
2019-05-14 17:07:28.378987: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-05-14 17:07:28.598760: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-14 17:07:28.601785: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x61c1dd0 executing computations on platform CUDA. Devices:
2019-05-14 17:07:28.601829: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-05-14 17:07:28.602453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
pciBusID: 0000:0b:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2019-05-14 17:07:28.602502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-14 17:07:28.605463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 17:07:28.605502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-05-14 17:07:28.605525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-05-14 17:07:28.605978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10470 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0b:00.0, compute capability: 6.1)
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-05-14 17:07:29.756412: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
Accuracy at step 0: 0.1094
Accuracy at step 1: 0.098
Accuracy at step 2: 0.098
Accuracy at step 3: 0.098
Accuracy at step 4: 0.098
Accuracy at step 5: 0.098
Accuracy at step 6: 0.098
Accuracy at step 7: 0.098
Accuracy at step 8: 0.098
Accuracy at step 9: 0.098
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (5 by maintainers)
I am closing this issue, since we don’t support non-official packages.