pytorch-lightning: Trying to infer the `batch_size` from an ambiguous collection

🐛 Bug

Yesterday, I was working with my code and it is was perfect, running with no error. But toady, I keep getting this UserWarning saying that the model trying to infer the batch_size from an ambiguous collection. I am not sure where it come from. I did not do any changes on my code. The error keeps getting different batch size with the sale warning :

/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/data.py:57: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 55. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/data.py:57: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/data.py:57: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 28. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/data.py:57: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 51. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"

Expected behavior

Running epochs smoothly.

Environment

PyTorch Lightning Version: 1.5
PyTorch Version: 1.9.0
Python version: 3.7.12
OS (e.g., Linux): Linux (Colab)
CUDA/cuDNN version: 11.2
GPU models and configuration:
How you installed PyTorch (conda, pip, source): pip
If compiling from source, the output of torch.__config__.show(): PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel® Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel® 64 architecture applications
- Intel® MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.0.5
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 27 (13 by maintainers)

Most upvoted comments

Great! So basically what we do internally is iterate over the batch that is passed inside the training_step. Doesn’t matter if any of its components are actually used to compute the loss/metric we use all its components to extract the batch size. In your example when you passed the other tensors including the comment_text which is string, even though it’s not used to compute the loss/metric we will consider it to extract the batch size because we don’t know how user is computing the loss/metric. Now depending upon the batch type which is a dict in your case we look for tensors to find out the batch_size of the dataset. Basically we look for the first dimension of each tensor and in case of string it’s just the len of string. Now while iterating over this dict if we found multiple possible batch sizes we put this warning to make sure the accumulation of the logged metric happening internally must use the correct batch size. Now for your case when you are returning comment_text which is the first component of the dict, pytorch must not be collating it as you expected because string lengths are not constant and computing a batch I think is not possible over there. So that’s why inside the training_step the comment_text component is variable most of the time and now since the first dim size is not matching with the first dim size of other tensors, this warning is being raised and we suggest to specify the batch size manually to make sure accumulation happens correctly.

+17

rohitgr7 on Nov 5, 2021

Hi there, I’m also carrying some lists of strings in my data where the length of the list is the batch size and the current implementation results in a warning in this situation. I would definitely prefer what @Nesqulck is suggesting. I feel that the current implementation infers some applications but punishes others that are also valid. I’m just commenting on this issue, since it has been closed, and I’m unsure if it is in focus anymore. Cheers ✌️

edit: I’m actually even transporting pathlib’s paths in lists edit2: and btw it also appears that you can not fix this by passing the batch_size to every log call…

+12

immanuelweber on Nov 8, 2021

The warning is not triggered by self.log but when loading a batch from the dataloaders. The trainer infers the batch size in every step which is needed for accumulated metrics when using self.log(on_epoch=True). It does so regardless if anything is logged at all, so using self.log(batch_size=…) does not suppress this warning. The warning is rather a reminder for the user to take care of this such that the accumulated metrics are correctly calculated.

I assume the reason you see this warning is similar to mine where there is a custom batch structure including special data types which are not necessarily needed for training or accumulating metrics, e.g. strings.

For now, a simple workaround would be to filter this warning manually by:

import warnings

warnings.filterwarnings(
    "ignore", ".*Trying to infer the `batch_size` from an ambiguous collection.*"
)

marcm-ml on Nov 6, 2021

comment_text might also be coming within the batch of the training_step. I don’t remember exactly how pytorch handles collating the string values, but if you are using automatic logging with on_step=true, we try to extract the batch_size from the batch itself by iterating over all its components. Now if there is any ambiguous batch for which we find more than 1 different batches, we log this warning along with the batch size to make sure that user is aware of batch_size being used to accumulate the logged values.

rohitgr7 on Nov 4, 2021

If I’m accumulating gradients, should I use the batch size of the data loader or the batch size of the data loader times how many batches are accumulated ?

je-santos on Feb 18, 2022

@rohitgr7 Can you elaborate why you consider strings to hold information about the batch size?

Maybe yielding None for types that are not tensors is less confusing, e.g. by removing the string comparison and changing the else statement to yield None instead of 1 in:

https://github.com/PyTorchLightning/pytorch-lightning/blob/348fc4b49f0e74acb9785b5179abb8fd01beb45a/pytorch_lightning/utilities/data.py#L30-L42

and handling this None case in:

https://github.com/PyTorchLightning/pytorch-lightning/blob/348fc4b49f0e74acb9785b5179abb8fd01beb45a/pytorch_lightning/utilities/data.py#L45-L62

Also, if no batch_size was found (batch_size = None after loop), a similar warning can be logged

marcm-ml on Nov 5, 2021

Here is the code if you want to check more where and how this warning is raised https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/utilities/data.py#L45

rohitgr7 on Nov 5, 2021

Hi, sorry! I’m afk for few days, trying my best to resolve this over my phone 😅. These changes were added in recent 1.5 release. In your dataset above you are returning the comment text, but you are not using it inside training_step. Can you try removing the comment text from the dictionary returned in your dataset and check if you still get this warning?

rohitgr7 on Nov 5, 2021

@yipliu since one of the components is a list that is not really a PyTorch tensor, it doesn’t rely on it to determine the batch_size and thus gives a warning to ensure the user the batch_size used. If you want to disable the warning you can set

self.log(..., batch_size=...)

rohitgr7 on Dec 20, 2021

Great! So basically what we do internally is iterate over the batch that is passed inside the training_step. Doesn’t matter if any of its components are actually used to compute the loss/metric we use all its components to extract the batch size. In your example when you passed the other tensors including the comment_text which is string, even though it’s not used to compute the loss/metric we will consider it to extract the batch size because we don’t know how user is computing the loss/metric. Now depending upon the batch type which is a dict in your case we look for tensors to find out the batch_size of the dataset. Basically we look for the first dimension of each tensor and in case of string it’s just the len of string. Now while iterating over this dict if we found multiple possible batch sizes we put this warning to make sure the accumulation of the logged metric happening internally must use the correct batch size. Now for your case when you are returning comment_text which is the first component of the dict, pytorch must not be collating it as you expected because string lengths are not constant and computing a batch I think is not possible over there. So that’s why inside the training_step the comment_text component is variable most of the time and now since the first dim size is not matching with the first dim size of other tensors, this warning is being raised and we suggest to specify the batch size manually to make sure accumulation happens correctly.

This is a very clear reply! Thank you!

Just one more question: why does pl need to know batch_size in each training_step?

ShaneTian on Feb 18, 2022

@cgarchbold your issue is being handled here: https://github.com/PyTorchLightning/pytorch-lightning/pull/10408

I agree with the above discussions. Opened an issue here: https://github.com/PyTorchLightning/pytorch-lightning/issues/10458

rohitgr7 on Nov 10, 2021

@rohitgr7 I totally get the motivation of batch_size sanity check, but what if I have such return type:

def dataset(Dataset):
    def __getitem__(self, idx):
        # some process here

        # return results
        return {'input_1': {'token_ids': Tensor, 'attention_mask': Tensor},
                'input_2': {'token_ids': Tensor, 'attention_mask': Tensor}

Then under current PL logic, the return size of each component will be 2, right?

The motivation of returning two dicts (input_1 and input_2 in my case) is that I usually need to test many different tokenization methods, raw data formats, etc. To keep the flexibility of code, I decide to return all of them and use a configuration to select a subset of them later (e.g., in DataLoader or LightingModule).

XiaomoWu on Nov 5, 2021