tfx: Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU

System information

Have I specified the code to reproduce the issue (Yes, No): No
Environment in which the code is executed: Dataflow on Google Cloud. n1-highmem-8, Nvidia T4 or P100 (both give the same error).
TensorFlow version: 2.11.0
TFX Version: 1.12.0
Python version: 3.7
Python dependencies (Dockerfile submitted to TFX):

FROM tensorflow/tfx:1.12.0

RUN pip3 install --upgrade --no-cache-dir pip \
    tensorflow-text==2.11.0 \
    tensorflow-recommenders==0.7.2 \
    scann==1.2.9

Describe the current behavior I am using the TFX BulkInferrer to apply a model with an Xception and BERT transform layer to a dataset of 2.5 million Examples with image and text features. After running for 7h and processing on Dataflow an OOM error is triggered.

ResourceExhaustedError: Graph execution error: OOM when allocating tensor with shape[512,128,167,167] and type float on /job:localhost/replica:0/task:0/device:GPU:0 
by allocator GPU_0_bfc [[{{node xception/block2_sepconv1/separable_conv2d}}]] 
…
OOM when allocating tensor with shape[448,128,167,167] and 
type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

We see the error happens in the GPU (device:GPU:0) in the xception model (node xception/block2_sepconv1/separable_conv2d) when trying to process large batches (shape[512,... and shape[448,...).

512*128*167*167 = 1827733504

That is a tensor with 1.8 billion floating point values, with 32-bit precision (4 bytes) should be (1.8e9 * 4bytes = 7.3GB). A single allocation attempt like that could fail on a GPU with 16GB.

Describe the expected behavior

The Beam BatchExecute algorithm should constrain the dynamic batch size to sizes less than 512 or 448 in order to fit onto the 16GB of GPU ram. The OOM happens on the “train” split (80% of the data) after hours of processing. On the smaller “eval” split (10%) the bulkInferrer succeeds. From the Dataflow metrics the batchsize_MAX was 256.

Standalone code to reproduce the issue The issue is data dependent. It is a basic BulkInferrer with imported examples and an imported model. Relevant Beam Args:

    "--runner=DataflowRunner",
    "--disk_size_gb=50",
    "--machine_type=n1-highmem-8", 
    "--experiments=use_runner_v2",
    "--experiments=worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver",
    # "--experiments=worker_accelerator=type:nvidia-tesla-p100;count:1;install-nvidia-driver",
    "--experiments=no_use_multiple_sdk_containers",

Other info / logs

Here are the logs.

Using a bottom up search of all the python virtual env source files, I searched for the function names in the failed step name highlighted on the Dataflow job-graph: RunInference[train]/RunInference/RunInferenceImpl/BulkInference/BatchElements/ParDo(_GlobalWindowsBatchingDoFn):

_GlobalWindowsBatchingDoFn is only called here in BatchElements
From BatchElements doscs has a parameter max_batch_size
BatchElements is called here in ml/inference/base.py::RunInference
Directly above is an interesting TODO to add a batch_size back off with a link to an open github issue . It mentions “Add batch_size back off in the case there are functional reasons large batch sizes cannot be handled.” This looks like my problem too.
bulk_infferer/executor.py calls tfx_bsl.public.beam RunInference, which delegates to RunInferenceImp
This calls the previously identified base.Runinference, but adds ‘BulkInference’ a text description
It also passes in a ModelHandler, which is responsible for providing the BatchElements kwargs
Since we configure the bulk_inferrer for Prediction, it will create a model_hanlder for in_process_inference using _get_saved_model_handler(), which should select PREDICTION from our inference spec
A _PredictModelHandler will be created. Neither it, nor its two TFX base classes (_BaseSavedModelHandler, _BaseModelHandler) override the base.ModelHandler.batch_elements_kwargs(), so the default empty dictionary will be provided.
This means the default max batch size of 10000 will be used in combination with whatever fancy adaptive batch size logic beam.BatchElements() uses.
This adaptive logic presumably has a bug which can cause the batch size to grow too large, causing the OOM. The previously mentioned open github issue confirms this suspicion.

About this issue

Original URL
State: open
Created a year ago
Comments: 22 (17 by maintainers)

Most upvoted comments

The upgrade to TFX 1.14.0 was held back by https://github.com/tensorflow/tfx/issues/6386. I am now applying the workaround mentioned there and should then have results after the next scheduled run at the start of Feb.

IzakMaraisTAL on Jan 9, 2024

In tfx 1.13 we introduced a new batching mode that tries to deserialize data in batches of ~ 100MB. It can be enabled with tfxio_use_byte_size_batching flag. Could you try updating to 1.13 and setting the flag to True?

iindyk on Jul 18, 2023

My bad… I will bump up this issue to our side and try to figure out the solution. Sorry for your inconvenience.

lego0901 on Apr 25, 2023

Ack. Thanks for your request and providing very abundant studies!!

lego0901 on Mar 14, 2023