tfx: Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU
System information
- Have I specified the code to reproduce the issue (Yes, No): No
- Environment in which the code is executed: Dataflow on Google Cloud. n1-highmem-8, Nvidia T4 or P100 (both give the same error).
- TensorFlow version: 2.11.0
- TFX Version: 1.12.0
- Python version: 3.7
- Python dependencies (Dockerfile submitted to TFX):
FROM tensorflow/tfx:1.12.0
RUN pip3 install --upgrade --no-cache-dir pip \
tensorflow-text==2.11.0 \
tensorflow-recommenders==0.7.2 \
scann==1.2.9
Describe the current behavior I am using the TFX BulkInferrer to apply a model with an Xception and BERT transform layer to a dataset of 2.5 million Examples with image and text features. After running for 7h and processing on Dataflow an OOM error is triggered.
ResourceExhaustedError: Graph execution error: OOM when allocating tensor with shape[512,128,167,167] and type float on /job:localhost/replica:0/task:0/device:GPU:0
by allocator GPU_0_bfc [[{{node xception/block2_sepconv1/separable_conv2d}}]]
…
OOM when allocating tensor with shape[448,128,167,167] and
type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
We see the error happens in the GPU (device:GPU:0) in the xception model (node xception/block2_sepconv1/separable_conv2d) when trying to process large batches (shape[512,... and shape[448,...).
512*128*167*167 = 1827733504
That is a tensor with 1.8 billion floating point values, with 32-bit precision (4 bytes) should be (1.8e9 * 4bytes = 7.3GB). A single allocation attempt like that could fail on a GPU with 16GB.
Describe the expected behavior
The Beam BatchExecute algorithm should constrain the dynamic batch size to sizes less than 512 or 448 in order to fit onto the 16GB of GPU ram. The OOM happens on the “train” split (80% of the data) after hours of processing. On the smaller “eval” split (10%) the bulkInferrer succeeds. From the Dataflow metrics the batchsize_MAX was 256.
Standalone code to reproduce the issue The issue is data dependent. It is a basic BulkInferrer with imported examples and an imported model. Relevant Beam Args:
"--runner=DataflowRunner",
"--disk_size_gb=50",
"--machine_type=n1-highmem-8",
"--experiments=use_runner_v2",
"--experiments=worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver",
# "--experiments=worker_accelerator=type:nvidia-tesla-p100;count:1;install-nvidia-driver",
"--experiments=no_use_multiple_sdk_containers",
Other info / logs
Here are the logs.
Using a bottom up search of all the python virtual env source files, I searched for the function names in the failed step name highlighted on the Dataflow job-graph: RunInference[train]/RunInference/RunInferenceImpl/BulkInference/BatchElements/ParDo(_GlobalWindowsBatchingDoFn):
- _GlobalWindowsBatchingDoFn is only called here in BatchElements
- From BatchElements doscs has a parameter max_batch_size
- BatchElements is called here in
ml/inference/base.py::RunInference - Directly above is an interesting TODO to add a batch_size back off with a link to an open github issue . It mentions “Add batch_size back off in the case there are functional reasons large batch sizes cannot be handled.” This looks like my problem too.
- bulk_infferer/executor.py calls tfx_bsl.public.beam RunInference, which delegates to RunInferenceImp
- This calls the previously identified base.Runinference, but adds ‘BulkInference’ a text description
- It also passes in a ModelHandler, which is responsible for providing the BatchElements kwargs
- Since we configure the bulk_inferrer for Prediction, it will create a model_hanlder for in_process_inference using _get_saved_model_handler(), which should select PREDICTION from our inference spec
- A _PredictModelHandler will be created. Neither it, nor its two TFX base classes (_BaseSavedModelHandler, _BaseModelHandler) override the base.ModelHandler.batch_elements_kwargs(), so the default empty dictionary will be provided.
- This means the default max batch size of 10000 will be used in combination with whatever fancy adaptive batch size logic beam.BatchElements() uses.
- This adaptive logic presumably has a bug which can cause the batch size to grow too large, causing the OOM. The previously mentioned open github issue confirms this suspicion.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 22 (17 by maintainers)
The upgrade to TFX 1.14.0 was held back by https://github.com/tensorflow/tfx/issues/6386. I am now applying the workaround mentioned there and should then have results after the next scheduled run at the start of Feb.
In tfx 1.13 we introduced a new batching mode that tries to deserialize data in batches of ~ 100MB. It can be enabled with tfxio_use_byte_size_batching flag. Could you try updating to 1.13 and setting the flag to True?
My bad… I will bump up this issue to our side and try to figure out the solution. Sorry for your inconvenience.
Ack. Thanks for your request and providing very abundant studies!!