tfx: Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU

System information

  • Have I specified the code to reproduce the issue (Yes, No): No
  • Environment in which the code is executed: Dataflow on Google Cloud. n1-highmem-8, Nvidia T4 or P100 (both give the same error).
  • TensorFlow version: 2.11.0
  • TFX Version: 1.12.0
  • Python version: 3.7
  • Python dependencies (Dockerfile submitted to TFX):
FROM tensorflow/tfx:1.12.0

RUN pip3 install --upgrade --no-cache-dir pip \
    tensorflow-text==2.11.0 \
    tensorflow-recommenders==0.7.2 \
    scann==1.2.9

Describe the current behavior I am using the TFX BulkInferrer to apply a model with an Xception and BERT transform layer to a dataset of 2.5 million Examples with image and text features. After running for 7h and processing on Dataflow an OOM error is triggered.

ResourceExhaustedError: Graph execution error: OOM when allocating tensor with shape[512,128,167,167] and type float on /job:localhost/replica:0/task:0/device:GPU:0 
by allocator GPU_0_bfc [[{{node xception/block2_sepconv1/separable_conv2d}}]] 
…
OOM when allocating tensor with shape[448,128,167,167] and 
type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

We see the error happens in the GPU (device:GPU:0) in the xception model (node xception/block2_sepconv1/separable_conv2d) when trying to process large batches (shape[512,... and shape[448,...).

512*128*167*167 = 1827733504

That is a tensor with 1.8 billion floating point values, with 32-bit precision (4 bytes) should be (1.8e9 * 4bytes = 7.3GB). A single allocation attempt like that could fail on a GPU with 16GB.

Describe the expected behavior

The Beam BatchExecute algorithm should constrain the dynamic batch size to sizes less than 512 or 448 in order to fit onto the 16GB of GPU ram. The OOM happens on the “train” split (80% of the data) after hours of processing. On the smaller “eval” split (10%) the bulkInferrer succeeds. From the Dataflow metrics the batchsize_MAX was 256.

Standalone code to reproduce the issue The issue is data dependent. It is a basic BulkInferrer with imported examples and an imported model. Relevant Beam Args:

    "--runner=DataflowRunner",
    "--disk_size_gb=50",
    "--machine_type=n1-highmem-8", 
    "--experiments=use_runner_v2",
    "--experiments=worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver",
    # "--experiments=worker_accelerator=type:nvidia-tesla-p100;count:1;install-nvidia-driver",
    "--experiments=no_use_multiple_sdk_containers",

Other info / logs

Here are the logs.

Using a bottom up search of all the python virtual env source files, I searched for the function names in the failed step name highlighted on the Dataflow job-graph: RunInference[train]/RunInference/RunInferenceImpl/BulkInference/BatchElements/ParDo(_GlobalWindowsBatchingDoFn):

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 22 (17 by maintainers)

Most upvoted comments

The upgrade to TFX 1.14.0 was held back by https://github.com/tensorflow/tfx/issues/6386. I am now applying the workaround mentioned there and should then have results after the next scheduled run at the start of Feb.

In tfx 1.13 we introduced a new batching mode that tries to deserialize data in batches of ~ 100MB. It can be enabled with tfxio_use_byte_size_batching flag. Could you try updating to 1.13 and setting the flag to True?

My bad… I will bump up this issue to our side and try to figure out the solution. Sorry for your inconvenience.

Ack. Thanks for your request and providing very abundant studies!!