ctrl: Could not allocate pinned host memory of size: 2147483648

Running !python2 generation.py --model_dir "/content/ctrl/seqlen256_v1.ckpt" in Colab outputs this:

WARNING: Logging before flag parsing goes to stderr.
W0912 03:52:40.595153 139689530402688 deprecation_wrapper.py:119] From generation.py:6: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

W0912 03:52:40.605669 139689530402688 deprecation_wrapper.py:119] From generation.py:35: The name tf.random.set_random_seed is deprecated. Please use tf.compat.v1.random.set_random_seed instead.

246534 unique words
2019-09-12 03:52:40.930801: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-09-12 03:52:40.971309: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:40.971914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
2019-09-12 03:52:40.972273: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-12 03:52:40.973635: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-12 03:52:40.975007: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-09-12 03:52:40.975404: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-09-12 03:52:40.976992: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-09-12 03:52:40.978135: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-09-12 03:52:40.981770: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-12 03:52:40.981927: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:40.982547: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:40.983109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-09-12 03:52:40.983494: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-09-12 03:52:41.114324: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.115113: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5574d0e20d80 executing computations on platform CUDA. Devices:
2019-09-12 03:52:41.115150: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-09-12 03:52:41.117511: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000170000 Hz
2019-09-12 03:52:41.117862: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5574d0e212c0 executing computations on platform Host. Devices:
2019-09-12 03:52:41.117916: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-09-12 03:52:41.118114: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.118668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
2019-09-12 03:52:41.118728: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-12 03:52:41.118748: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-12 03:52:41.118766: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-09-12 03:52:41.118784: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-09-12 03:52:41.118811: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-09-12 03:52:41.118840: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-09-12 03:52:41.118858: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-12 03:52:41.118934: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.119479: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.120052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-09-12 03:52:41.120121: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-12 03:52:41.121241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-12 03:52:41.121268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-09-12 03:52:41.121280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-09-12 03:52:41.121403: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.121995: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:52:41.122491: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:40] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2019-09-12 03:52:41.122537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14221 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
W0912 03:52:58.330300 139689530402688 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0912 03:52:58.330642 139689530402688 deprecation_wrapper.py:119] From generation.py:124: The name tf.train.AdagradOptimizer is deprecated. Please use tf.compat.v1.train.AdagradOptimizer instead.

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 256)]        0                                            
__________________________________________________________________________________________________
tied_embedding_softmax (TiedEmb multiple             315810054   input_1[0][0]                    
                                                                 encoder[0][0]                    
__________________________________________________________________________________________________
encoder (Encoder)               (None, 256, 1280)    1322154496  tied_embedding_softmax[0][0]     
==================================================================================================
Total params: 1,637,964,550
Trainable params: 1,637,964,550
Non-trainable params: 0
__________________________________________________________________________________________________
None
2019-09-12 03:52:58.496625: W tensorflow/core/framework/allocator.cc:107] Allocation of 1262254080 exceeds 10% of system memory.
tcmalloc: large alloc 1262256128 bytes == 0x557523406000 @  0x7f0c00918b6b 0x7f0c00938379 0x7f0bbd80d754 0x7f0bbd7c8c8a 0x7f0bbd505f11 0x7f0bbd518f08 0x7f0bc366a00c 0x7f0bc3660298 0x7f0bc10448c7 0x7f0bc0fbc97c 0x7f0bc0fbed9d 0x5574cfe6af6e 0x5574cfe6152a 0x5574cfe68fce 0x5574cfe6152a 0x5574cfe68fce 0x5574cfe6152a 0x5574cfe7d03c 0x5574cfe4cf1e 0x5574cfe662d5 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a
tcmalloc: large alloc 1262256128 bytes == 0x55756e7ce000 @  0x7f0c009361e7 0x7f0bfe37c771 0x7f0bfe3e4028 0x7f0bfe3d90d5 0x7f0bfe46ff77 0x5574cfe63e8a 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe695d6 0x5574cfe6152a 0x5574cfe68fce 0x5574cfe6152a 0x5574cfe68fce 0x5574cfe6152a 0x5574cfe60fb9 0x5574cfe91e7f 0x5574cfe8cc12 0x5574cfe8c09d 0x5574cfe3ad6b 0x7f0c00533b97 0x5574cfe3a5ea
W0912 03:53:06.230777 139689530402688 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/initializers.py:143: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0912 03:53:11.251795 139689530402688 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
2019-09-12 03:53:24.403230: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:53:24.403729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
2019-09-12 03:53:24.403847: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-12 03:53:24.403869: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-12 03:53:24.403910: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-09-12 03:53:24.403931: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-09-12 03:53:24.403952: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-09-12 03:53:24.403975: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-09-12 03:53:24.403994: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-12 03:53:24.404096: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:53:24.404475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:53:24.404802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-09-12 03:53:24.404864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-12 03:53:24.404878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-09-12 03:53:24.404901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-09-12 03:53:24.405005: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:53:24.405377: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-12 03:53:24.405756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14221 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2019-09-12 03:53:32.494371: E tensorflow/stream_executor/cuda/cuda_driver.cc:890] failed to alloc 2147483648 bytes on host: CUDA_ERROR_INVALID_VALUE: invalid argument
2019-09-12 03:53:32.511468: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2147483648

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 16 (2 by maintainers)

Most upvoted comments

There might be a way to hack a version of the code with (slightly) smaller memory requirement. Let me explore and update here.

keskarnitish on Sep 12, 2019

@GrahamboJangles

Both RAM and GPU RAM. The process seems to use about 36 GB of host RAM when idle, and it requires about 15.5 GB of GPU memory.
I’ve only tried the 256 model as well. But I think the 512 model is the same size. The readme says: “The model architecture is identical for both checkpoints. The former is trained with lower training sequence length (256) while the latter is trained with a larger one (512).” In a transformer, the sequence length doesn’t affect the number of parameters.
I think @keskarnitish is talking about reducing the GPU memory requirement from 15458 MiB to less than the 15079 MiB that a T4 GPU can handle.
@minimaxir is talking about SSHing into a raw Google Compute Engine instance, which is probably a good idea. Trying to use Colab for this might just make things more difficult.

AdamDanielKing on Sep 12, 2019

@minimaxir It is strange. I’m not sure what makes the difference, but the amount of memory shown as available from a T4 (in TensorFlow logs or nvidia-smi) is less than with a V100. Right now nvidia-smi shows me:

GPU	Memory
T4	15079MiB
V100	16130MiB
P100	16280MiB

Bryan McCann tweeted that the model needs 15458 MiB so this seems to explain why the T4 is the only “16 GB” GPU that can’t fit it. I also noticed this in one of my own projects: a batch size that worked on a V100 would be too much for a T4.

AdamDanielKing on Sep 12, 2019

I added a new branch which allows for inference on GPUs with lower available memory. I tested it on K80s on Collaboratory here https://colab.research.google.com/drive/1hVveBQShDru1Mjnhe4C21uQv4A2eH1tV

The details on how to use it can be found at the top of the README (Update @ Sep 19, 2019 subsection).

This is still in testing phase so expect a few bumps. I will merge it into master once it stabilizes.

Closing this for now, please reopen if there are issues.

keskarnitish on Sep 19, 2019

Given that this app only has a CLI at the moment, using a local runtime for Colab seems redundant; might as well run it directly on the VM by SSHing into the instance if we’re going to have one up.

The VMs can be built as preemptible: for the config described it’ll be about $0.50/hr, which is reasonable. I also believe that new GCP projects come with some GPU quota by default now; i’ll double check.

Additionally, the VMs must be launched with full GCP API access in order for gsutil to be able to get the model.

I can write up a guide once I get things working.

minimaxir on Sep 12, 2019