tensorflow: tensorflow training stuck after "Successfully opened dynamic library libcublas.so.10.0"

I’m using TF 1.13 with cuda 10. I’m training a model but the TF code freezes after printing Successfully opened dynamic library libcublas.so.10.0. The GPU memory was all occupied and the CPU was also fully occupied meaning that it was not a deadlock or something (my problem I think is different than issue #32017). The training script:

CUDA_AVAILABLE_DEVICES=0,1,2,3 t2t-trainer \
  --data_dir=~/t2t_data \
  --output_dir=train_outputs/div2k \
  --problem=img2img_div2k \
  --model=img2img_transformer \
  --hparams_set=img2img_transformer \
  --train_steps=2000000 \
  --eval_steps=5000 --local_eval_frequency=5000 --worker_gpu=4

Output:

INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with identity_modality.top
INFO:tensorflow:Transforming feature 'inputs' with identity_modality.bottom
INFO:tensorflow:Transforming feature 'targets' with identity_modality.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with identity_modality.top
INFO:tensorflow:Base learning rate: 0.200000
INFO:tensorflow:Trainable Variables Total size: 47118592
INFO:tensorflow:Non-trainable variables Total size: 5
INFO:tensorflow:Using optimizer Adam
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
^[[AINFO:tensorflow:Graph was finalized.
2020-04-01 05:48:23.753790: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-01 05:48:25.319599: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x13dc3370 executing computations on platform CUDA. Devices:
2020-04-01 05:48:25.319671: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla P40, Compute Capability 6.1
2020-04-01 05:48:25.319694: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (1): Tesla P40, Compute Capability 6.1
2020-04-01 05:48:25.319713: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (2): Tesla P40, Compute Capability 6.1
2020-04-01 05:48:25.319732: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (3): Tesla P40, Compute Capability 6.1
2020-04-01 05:48:25.325520: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2593990000 Hz
2020-04-01 05:48:25.330211: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x197d480 executing computations on platform Host. Devices:
2020-04-01 05:48:25.330255: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-04-01 05:48:25.330426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0b1c:00:00.0
totalMemory: 22.38GiB freeMemory: 22.23GiB
2020-04-01 05:48:25.330494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 246c:00:00.0
totalMemory: 22.38GiB freeMemory: 22.23GiB
2020-04-01 05:48:25.330551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 4de1:00:00.0
totalMemory: 22.38GiB freeMemory: 22.23GiB
2020-04-01 05:48:25.330608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 3 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 61fa:00:00.0
totalMemory: 22.38GiB freeMemory: 22.23GiB
2020-04-01 05:48:25.330925: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2, 3
2020-04-01 05:48:25.337584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-01 05:48:25.337621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1 2 3
2020-04-01 05:48:25.337644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N N N N
2020-04-01 05:48:25.337663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   N N N N
2020-04-01 05:48:25.337681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2:   N N N N
2020-04-01 05:48:25.337700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3:   N N N N
2020-04-01 05:48:25.337915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21773 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0b1c:00:00.0, compute capability: 6.1)
2020-04-01 05:48:25.338239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 21773 MB memory) -> physical GPU (device: 1, name: Tesla P40, pci bus id: 246c:00:00.0, compute capability: 6.1)
2020-04-01 05:48:25.338604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 21773 MB memory) -> physical GPU (device: 2, name: Tesla P40, pci bus id: 4de1:00:00.0, compute capability: 6.1)
2020-04-01 05:48:25.339518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 21773 MB memory) -> physical GPU (device: 3, name: Tesla P40, pci bus id: 61fa:00:00.0, compute capability: 6.1)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into train_outputs/model.ckpt.
2020-04-01 05:50:09.675984: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

I enabled TF_CPP_MIN_VLOG_LEVEL=2 and here is some part of the log which is recurring but with different values everytime:

2020-04-01 05:30:33.932882: I tensorflow/core/common_runtime/executor.cc:1804] Synchronous kernel done: 8 step -361974947767986987 {{node ToInt64}} = Cast[DstT=DT_INT64, SrcT=DT_FLOAT, Truncate=false, _device="/device:CPU:0"](resize/Squeeze) device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.933436: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: -882416251439127608 kernel_name: "resize/Squeeze" tensor { dtype: DT_FLOAT shape { dim {
size: 8 } dim { size: 8 } dim { size: 3 } } allocation_description { requested_bytes: 768 allocator_name: "cpu" ptr: 140093260055424 } } }
2020-04-01 05:30:33.933454: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocator_name: "cpu" }
2020-04-01 05:30:33.931442: I tensorflow/core/common_runtime/executor.cc:1804] Synchronous kernel done: 4 step -2826364024402348450 {{node ParseSingleExample/Reshape}} = Const[dtype=DT_INT64, value=Tensor<type:
int64 shape: [1] values: 0>, _device="/device:CPU:0"]() device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.931993: I tensorflow/core/common_runtime/executor.cc:1661] Process node: 12 step -9020318172928929357 {{node LogicalOr}} = LogicalOr[_device="/device:CPU:0"](Equal, Equal_1) device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.932017: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: -7271963003795482065 kernel_name: "TensorDataset" tensor { dtype: DT_VARIANT shape { } al
location_description { requested_bytes: 8 allocator_name: "cpu" ptr: 140092371056192 } } }
2020-04-01 05:30:33.933553: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: -9020318172928929357 kernel_name: "LogicalOr" tensor { dtype: DT_BOOL shape { } allocation_description { requested_bytes: 1 allocator_name: "cpu" ptr: 140093042046528 } } }
2020-04-01 05:30:33.931544: I tensorflow/core/common_runtime/executor.cc:1804] Synchronous kernel done: 7 step -1925456608919473175 {{node resize/Squeeze}} = Squeeze[T=DT_FLOAT, squeeze_dims=[0], _device="/device:CPU:0"](resize/ResizeArea) device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.933059: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocator_name: "cpu" }
2020-04-01 05:30:33.933089: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocator_name: "cpu" }
2020-04-01 05:30:33.933094: I tensorflow/core/common_runtime/executor.cc:1804] Synchronous kernel done: 10 step -7022465578876499613 {{node Equal}} = Equal[T=DT_STRING, _device="/device:CPU:0"](ParseSingleExample/ParseSingleExample:2, Equal/y) device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.933676: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: -3410101109738072632 kernel_name: "ParseSingleExample/ParseSingleExample" tensor { dtype: DT_INT64 shape { dim { size: 1 } } allocation_description { requested_bytes: 8 allocator_name: "cpu" ptr: 140092035786688 } } }
2020-04-01 05:30:33.932257: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: -4867250882158238528 kernel_name: "TensorDataset" tensor { dtype: DT_VARIANT shape {
} allocation_description { requested_bytes: 8 allocator_name: "cpu" ptr: 140092975893504 } } }
2020-04-01 05:30:33.933150: I tensorflow/core/common_runtime/executor.cc:1804] Synchronous kernel done: 4 step -8238964452958482503 {{node ParseSingleExample/Reshape}} = Const[dtype=DT_INT64, value=Tensor<type:
int64 shape: [1] values: 0>, _device="/device:CPU:0"]() device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.933757: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: -4867250882158238528 kernel_name: "TensorDataset" tensor { dtype: DT_VARIANT shape { } allocation_description { requested_bytes: 8 allocator_name: "cpu" ptr: 140092975893504 } } }
2020-04-01 05:30:33.933777: I tensorflow/core/common_runtime/executor.cc:1661] Process node: 5 step -8238964452958482503 {{node ParseSingleExample/Const_1}} = Const[dtype=DT_STRING, value=Tensor<type: string shape: [0] values: >, _device="/device:CPU:0"]() device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.933793: I tensorflow/core/common_runtime/executor.cc:1804] Synchronous kernel done: 9 step -4867250882158238528 {{node TensorDataset}} = TensorDataset[Toutput_types=[DT_UINT8, DT_INT64], outp
ut_shapes=[[?,?,3], [8,8,3]], _device="/device:CPU:0"](arg0, ToInt64) device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.933814: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: -8238964452958482503 kernel_name: "ParseSingleExample/Const_1" tensor { dtype: DT_STRING
shape { dim { } } } }
2020-04-01 05:30:33.932821: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocator_name: "cpu" }
2020-04-01 05:30:33.933305: I tensorflow/core/common_runtime/executor.cc:1661] Process node: 8 step -5799046217055003931 {{node ToInt64}} = Cast[DstT=DT_INT64, SrcT=DT_FLOAT, Truncate=false, _device="/device:CPU
:0"](resize/Squeeze) device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.933380: I tensorflow/core/common_runtime/executor.cc:1804] Synchronous kernel done: 6 step -200002097097137980 {{node resize/ResizeArea}} = ResizeArea[T=DT_UINT8, align_corners=false, _device
="/device:CPU:0"](resize/ExpandDims, resize/size) device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.933917: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: -5799046217055003931 kernel_name: "ToInt64" tensor { dtype: DT_INT64 shape { dim { si
ze: 8 } dim { size: 8 } dim { size: 3 } } allocation_description { requested_bytes: 1536 allocator_name: "cpu" ptr: 140093916398656 } } }
2020-04-01 05:30:33.933939: I tensorflow/core/common_runtime/executor.cc:1661] Process node: 7 step -200002097097137980 {{node resize/Squeeze}} = Squeeze[T=DT_FLOAT, squeeze_dims=[0], _device="/device:CPU:0"](re
size/ResizeArea) device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.933954: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: -5799046217055003931 kernel_name: "ToInt64" tensor { dtype: DT_INT64 shape { dim { size:
8 } dim { size: 8 } dim { size: 3 } } allocation_description { requested_bytes: 1536 allocator_name: "cpu" ptr: 140093916398656 } } }
2020-04-01 05:30:33.933479: I tensorflow/core/common_runtime/executor.cc:1804] Synchronous kernel done: 7 step -882416251439127608 {{node resize/Squeeze}} = Squeeze[T=DT_FLOAT, squeeze_dims=[0], _device="/device:CPU:0"](resize/ResizeArea) device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.933489: I tensorflow/core/common_runtime/executor.cc:1661] Process node: 9 step -361974947767986987 {{node TensorDataset}} = TensorDataset[Toutput_types=[DT_UINT8, DT_INT64], output_shapes=[[
?,?,3], [8,8,3]], _device="/device:CPU:0"](arg0, ToInt64) device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.933519: I tensorflow/core/common_runtime/executor.cc:1661] Process node: 5 step -2826364024402348450 {{node ParseSingleExample/Const_1}} = Const[dtype=DT_STRING, value=Tensor<type: string sha
pe: [0] values: >, _device="/device:CPU:0"]() device: /job:localhost/replica:0/task:0/device:CPU:0
2020-04-01 05:30:33.934058: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: -361974947767986987 kernel_name: "TensorDataset" tensor { dtype: DT_VARIANT shape { } allocation_description { requested_bytes: 8 allocator_name: "cpu" ptr: 140092242034688 } } }
2020-04-01 05:30:33.934073: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: -2826364024402348450 kernel_name: "ParseSingleExample/Const_1" tensor { dtype: DT_STRING
shape { dim { } } } }

I literally have no idea what the problem is. I tried with another dataset and the model started training but with this particular datasets the model is constantly in executor.cc:1161 and executor.cc:1804. Is it doing computations on CPU rather than GPU maybe? but the GPU memory is all occupied and the TF has recognized the GPU devices.

P.S: I also added a signal handler to my python code to print the stack trace upon receiving a specific signal (to see where in the python code it is spending the time), but after a certain point, i guess the signal handlers are overridden as the code does not respond to my manual signal. It does no longer even respond to SIGINT as well.

any idea?

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 4
Comments: 16 (4 by maintainers)

Most upvoted comments

I’m trying to run tf object detection model. Getting the same issue; Stuck after Successfully opened dynamic library libcuda.so.1

AzinPoshtyar on Nov 2, 2020

@py4, Can you try @jetjodh’s suggestion and let us know if that helps in resolving your issue. Thanks

Reducing the batch_size to 1 didn’t work out for me.

HPARAMS_SET=transformer_tpu
MODEL=transformer

HPARAMS="batch_size=1,num_hidden_layers=16"

00001101-xt on Aug 19, 2020