DALI: VideoReader blocking while loading videos

Hi everybody,

I’m opening an issue since I am encountering several problems writing a Pipeline for loading video files. I’m not sure of wheter DALI is the best tool for my task and neither if I am using it properly, so I would start by first explaining my goal. I have a very huge dataset consisting of hundreds of thousands of videos, and I would like to use DALI’s VideoReader to build a PyTorch DataLoader since, according to the documentation, DALI’s VideoReader uses NVIDIA GPU’s hardware-accelerated video decoding, so I would like to speed-up and eventually parallelize the training of a CNN using the GPU for the data loading operations. I took the Video Super Resolution example (https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/video/superres_pytorch/README.html) and wrote my personal DALILoader as it follows:

"""
Dataset class for wrapping the DALI Pipeline in PyTorch
"""

import sys
import copy
from glob import glob
import math
import os

import torch
from torch.utils.data import DataLoader

from nvidia.dali.pipeline import Pipeline
from nvidia.dali.plugin import pytorch
import nvidia.dali.ops as ops
import nvidia.dali.types as types
import datetime

class VideoReaderPipeline(Pipeline):
    """
    DALI Pipeline for opening a video, normalizing it and randomly crop it
    """
    def __init__(self, batch_size, sequence_length, num_threads, device_id, files, crop_size, shuffle=False,
                 isGray=False):
        super(VideoReaderPipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
        if isGray:
            self.num_channels = 1
        else:
            self.num_channels = 3
        # Video reader
        self.reader = ops.VideoReader(device="gpu", file_list=files, sequence_length=sequence_length, normalized=False,
                                      random_shuffle=shuffle, image_type=types.RGB, dtype=types.UINT8, initial_fill=16,
                                      channels=self.num_channels)
        # CropMirrorNormalize allows for cropping, mirroring, normalizing and finally transposing the output tensor
        # (defalut is CHW, so we don't insert an explicit transpose operation in the pipeline)
        self.crop = ops.CropMirrorNormalize(device="gpu", crop=crop_size, mean=[127.0],
                                            std=[127.0], mirror=0, output_dtype=types.FLOAT)
        # Random number generator for specifying the cropping position (for now crop each frame singularly without
        # looking into the temporal dimension)
        self.uniform = ops.Uniform(range=(0.0, 1.0))
        self.uniform1 = ops.Uniform(range=(0.0, 0.0))

    def define_graph(self):
        input = self.reader(name="Reader")
        output = self.crop(input[0], crop_pos_z=self.uniform1(), crop_pos_x=self.uniform(), crop_pos_y=self.uniform())
        return output, input[1]

class DALILoader():
    def __init__(self, batch_size, file_list, sequence_length, crop_size, device):
        self.pipeline = VideoReaderPipeline(batch_size=batch_size,
                                            sequence_length=sequence_length,
                                            num_threads=2,
                                            device_id=device,
                                            files=file_list,
                                            crop_size=crop_size)
        self.pipeline.build()
        self.epoch_size = self.pipeline.epoch_size("Reader")
        self.dali_iterator = pytorch.DALIGenericIterator(self.pipeline,
                                                         ["file", "label"],
                                                         self.epoch_size,
                                                         auto_reset=True)
    def __len__(self):
        return int(self.epoch_size)
    def __iter__(self):
        return self.dali_iterator.__iter__()

I created a file_list.csv where I have written all the paths and the labels of the videos (my task is a simple binary classification), and this simple test script:

if __name__=='__main__':
    print('Starting test...')
    batch_size = 1
    seq_length = 100
    file_list = 'path/to/file_list.csv'
    loader = DALILoader(batch_size, file_list, seq_length, [0.0, 256.0, 256.0], 0)
    print('Loading videos at {}...'.format(datetime.datetime.now()))
    iterator = loader.__iter__()
    while iterator:
        item = iterator.__next__()
        for label in item[0]["label"]:
            print('Video is positive!') if label == 1 else print('Video is negative!')
    print('Videos loaded at {}'.format(datetime.datetime.now()))
    print('Finishing test!')

I want simply to load 100 frames of each video, then crop them randomly in the height and width dimension. As a first test, I didn’t want to use the whole dataset, so I used just a portion of it (we are talking about 4000/5000 videos in any case), but when I run the code I have encountered three major errors. I report them in “discovery order”, as after I have encountered the first one I have simplified my code reducing the task complexity too for doing a little debugging. I have DALI 0.16.0 installed, running the code on an Ubuntu machine with an E5-2630 CPU, 128GB of RAM and a single NVIDIA Quadro P6000 GPU.

The first error appears by simply running the script above as it is:

Traceback (most recent call last):
  File "DALILoader.py", line 76, in <module>
    loader = DALILoader(batch_size, file_list, seq_length, [0.0, 256.0, 256.0], 0)
  File "DALILoader.py", line 59, in __init__
    self.pipeline.build()
  File "/nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/pipeline.py", line 308, in build
    self._pipe.Build(self._names_and_devices)
RuntimeError: [/opt/dali/dali/operators/reader/loader/video_loader.cc:190] Could not open file /nas/public/dataset/1848521_1441897_A_000.mp4 because of Too many open files
Stacktrace (32 entries):
[frame 0]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x1434ae) [0x7f03e72794ae]
[frame 1]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x40c8db) [0x7f03e75428db]
[frame 2]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x412c72) [0x7f03e7548c72]
[frame 3]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x43730f) [0x7f03e756d30f]
[frame 4]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x438202) [0x7f03e756e202]
[frame 5]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(std::_Function_handler<std::unique_ptr<dali::OperatorBase, std::default_delete<dali::OperatorBase> > (dali::OpSpec const&), std::unique_ptr<dali::OperatorBase, std::default_delete<dali::OperatorBase> > (*)(dali::OpSpec const&)>::_M_invoke(std::_Any_data const&, dali::OpSpec const&)+0xc) [0x7f03e727476c]
[frame 6]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x131284) [0x7f03e5cd9284]
[frame 7]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(dali::InstantiateOperator(dali::OpSpec const&)+0x34e) [0x7f03e5cd87ce]
[frame 8]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(dali::OpGraph::InstantiateOperators()+0xa7) [0x7f03e5c91267]
[frame 9]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(dali::Pipeline::Build(std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > >)+0xad8) [0x7f03e5cf7858]
[frame 10]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/backend_impl.cpython-37m-x86_64-linux-gnu.so(+0x3758f) [0x7f03ed52e58f]
[frame 11]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/backend_impl.cpython-37m-x86_64-linux-gnu.so(+0x1fe03) [0x7f03ed516e03]
[frame 12]: python3(_PyMethodDef_RawFastCallKeywords+0x264) [0x55d62a49e6e4]
[frame 13]: python3(_PyCFunction_FastCallKeywords+0x21) [0x55d62a49e801]
[frame 14]: python3(_PyEval_EvalFrameDefault+0x537e) [0x55d62a4fa7ae]
[frame 15]: python3(_PyFunction_FastCallKeywords+0xfb) [0x55d62a49d79b]
[frame 16]: python3(_PyEval_EvalFrameDefault+0x6a0) [0x55d62a4f5ad0]
[frame 17]: python3(_PyFunction_FastCallDict+0x10b) [0x55d62a43c50b]
[frame 18]: python3(_PyObject_Call_Prepend+0xde) [0x55d62a453cbe]
[frame 19]: python3(+0x1710aa) [0x55d62a4960aa]
[frame 20]: python3(_PyObject_FastCallKeywords+0x128) [0x55d62a49e9b8]
[frame 21]: python3(_PyEval_EvalFrameDefault+0x4bf6) [0x55d62a4fa026]
[frame 22]: python3(_PyEval_EvalCodeWithName+0x2f9) [0x55d62a43b4f9]
[frame 23]: python3(PyEval_EvalCodeEx+0x44) [0x55d62a43c3c4]
[frame 24]: python3(PyEval_EvalCode+0x1c) [0x55d62a43c3ec]
[frame 25]: python3(+0x22f874) [0x55d62a554874]
[frame 26]: python3(PyRun_FileExFlags+0xa1) [0x55d62a55eb81]
[frame 27]: python3(PyRun_SimpleFileExFlags+0x1c3) [0x55d62a55ed73]
[frame 28]: python3(+0x23ae5f) [0x55d62a55fe5f]
[frame 29]: python3(_Py_UnixMain+0x3c) [0x55d62a55ff7c]
[frame 30]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f044837fb97]
[frame 31]: python3(+0x1e0122) [0x55d62a505122]

My first question therefore is:

Is there a limit for the number of videos to be opened by a VideoReader? Obviously 100 frames of 4000 videos cannot fit on a GPU memory, but I have imagined that each video would be loaded singularly only at the next() call of the DALIGenericIterator, so the frames would be loaded only when needed. Am I wrong? Moreover, for taking 100 frames of each video, is it right to have batch_size=1 and seq_length=100?

As a second experiment, I reduced the number of videos to 100. This time it seems that DALI is able to load the videos, but I got another error instead:

Traceback (most recent call last):
  File "DALILoader.py", line 76, in <module>
    loader = DALILoader(batch_size, file_list, seq_length, [0.0, 256.0, 256.0], 0)
  File "DALILoader.py", line 64, in __init__
    auto_reset=True)
  File "/nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 147, in __init__
    self._first_batch = self.next()
  File "/nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 244, in next
    return self.__next__()
  File "/nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 162, in __next__
    outputs.append(p.share_outputs())
  File "/nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/pipeline.py", line 399, in share_outputs
    return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline: [/opt/dali/dali/operators/fused/crop_mirror_normalize.h:155] Assert on "output_layout_.is_permutation_of(input_layout_)" failed: The requested output layout is not a permutation of input layout.
Stacktrace (11 entries):
[frame 0]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x1434ae) [0x7fed0a7c34ae]
[frame 1]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x79aa25) [0x7fed0ae1aa25]
[frame 2]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x760f00) [0x7fed0ade0f00]
[frame 3]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x3dcead) [0x7fed0aa5cead]
[frame 4]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0xc3c6d) [0x7fed091b5c6d]
[frame 5]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0xc4637) [0x7fed091b6637]
[frame 6]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x960e3) [0x7fed091880e3]
[frame 7]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x1139c6) [0x7fed092059c6]
[frame 8]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x6f6c90) [0x7fed097e8c90]
[frame 9]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fed6bca06db]
[frame 10]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fed6b9c988f]

Current pipeline object is no longer valid.

I am probably using the crop operation wrong, so

Is it right to have the CropMirrorNormalize working on the input[0] element? I expect that element to be the 100 frames batch tensor, with the input[1] element being the label instead. Am I guessing right? Is something wrong in my code or in the way I am using the CropMirrorNormalize operation?

Finally, as a last experiment I have removed the CropMirrorNormalize operation and built the pipeline using the VideoReader only. This time the code runs with no error, but it seems to “stop” after loading 3 videos only. The terminal stayed “freezed” for several minutes, and I had to kill the process. So, I am wondering

Do you have any guess for this behaviour?

I hope that my post is comprehensible and I apologize in advance for asking maybe too many non-related questions altogether, but I could not find any answer in the docs or in other issues here on GitHub.

Thank you in advance!

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 48 (27 by maintainers)

Most upvoted comments

Hey @JanuszL ,

I think I finally found the root of the problem.

when a bigger image at the output of some random operator (like random resize or just decoder) appear memory need to be additionally allocated.

You were right! As you were suggesting, I found out that some of the videos present a resolution greater than the 1920x1080 of the full HD! What happened here

As I said now the code runs, but at video number 67 it seems to fail to allocate the memory for the GPU

is that the successive video in the list (video number 68) has a resolution of 3840x2160 pixels; while prefetching the successive batch with DALI the GPU runs out of memory and therefore from here the CUDA allocation failed error pops out. Reducing the sequence_length allowed me to see the allocation of the biggest frames in the GPU and the “spike” in the memory consumption; until #1643 is merged, I will probably work with shorter sequences.

Speaking of this, I would like to use the stride argument of the VideoReader. If I have a video of, let’s say, 10 (numbered) frames, like this [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], and I would like to have a sequence of sequence_length=5 and stride=2, this means that the resulting sequence will contain one frame every two right? Resulting in something like this [0, 2, 4, 6, 8]?

Thank you really really much for your help! This code had kept me busy for weeks now, without your assistance I could never make it work!

CrohnEngineer on Jan 15, 2020

Thank You for your response. I issue is resolved right now.

0Rutuja28-97 on Oct 20, 2022

https://github.com/NVIDIA/DALI/pull/1643 should reduce memory consumption

JanuszL on Jan 10, 2020

@CrohnEngineer - I see one incomplete implementation in DALI. Even you ask the VideoReader for dtype=types.UINT8 it internally allocates memory for float32 data. I will fix that soon, it should reduce memory occupation 4 times (I hope).

JanuszL on Jan 10, 2020