DALI: VideoReader blocking while loading videos
Hi everybody,
I’m opening an issue since I am encountering several problems writing a Pipeline for loading video files. I’m not sure of wheter DALI is the best tool for my task and neither if I am using it properly, so I would start by first explaining my goal.
I have a very huge dataset consisting of hundreds of thousands of videos, and I would like to use DALI’s VideoReader to build a PyTorch DataLoader since, according to the documentation, DALI’s VideoReader uses NVIDIA GPU’s hardware-accelerated video decoding
, so I would like to speed-up and eventually parallelize the training of a CNN using the GPU for the data loading operations.
I took the Video Super Resolution example (https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/video/superres_pytorch/README.html) and wrote my personal DALILoader as it follows:
"""
Dataset class for wrapping the DALI Pipeline in PyTorch
"""
import sys
import copy
from glob import glob
import math
import os
import torch
from torch.utils.data import DataLoader
from nvidia.dali.pipeline import Pipeline
from nvidia.dali.plugin import pytorch
import nvidia.dali.ops as ops
import nvidia.dali.types as types
import datetime
class VideoReaderPipeline(Pipeline):
"""
DALI Pipeline for opening a video, normalizing it and randomly crop it
"""
def __init__(self, batch_size, sequence_length, num_threads, device_id, files, crop_size, shuffle=False,
isGray=False):
super(VideoReaderPipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
if isGray:
self.num_channels = 1
else:
self.num_channels = 3
# Video reader
self.reader = ops.VideoReader(device="gpu", file_list=files, sequence_length=sequence_length, normalized=False,
random_shuffle=shuffle, image_type=types.RGB, dtype=types.UINT8, initial_fill=16,
channels=self.num_channels)
# CropMirrorNormalize allows for cropping, mirroring, normalizing and finally transposing the output tensor
# (defalut is CHW, so we don't insert an explicit transpose operation in the pipeline)
self.crop = ops.CropMirrorNormalize(device="gpu", crop=crop_size, mean=[127.0],
std=[127.0], mirror=0, output_dtype=types.FLOAT)
# Random number generator for specifying the cropping position (for now crop each frame singularly without
# looking into the temporal dimension)
self.uniform = ops.Uniform(range=(0.0, 1.0))
self.uniform1 = ops.Uniform(range=(0.0, 0.0))
def define_graph(self):
input = self.reader(name="Reader")
output = self.crop(input[0], crop_pos_z=self.uniform1(), crop_pos_x=self.uniform(), crop_pos_y=self.uniform())
return output, input[1]
class DALILoader():
def __init__(self, batch_size, file_list, sequence_length, crop_size, device):
self.pipeline = VideoReaderPipeline(batch_size=batch_size,
sequence_length=sequence_length,
num_threads=2,
device_id=device,
files=file_list,
crop_size=crop_size)
self.pipeline.build()
self.epoch_size = self.pipeline.epoch_size("Reader")
self.dali_iterator = pytorch.DALIGenericIterator(self.pipeline,
["file", "label"],
self.epoch_size,
auto_reset=True)
def __len__(self):
return int(self.epoch_size)
def __iter__(self):
return self.dali_iterator.__iter__()
I created a file_list.csv
where I have written all the paths and the labels of the videos (my task is a simple binary classification), and this simple test script:
if __name__=='__main__':
print('Starting test...')
batch_size = 1
seq_length = 100
file_list = 'path/to/file_list.csv'
loader = DALILoader(batch_size, file_list, seq_length, [0.0, 256.0, 256.0], 0)
print('Loading videos at {}...'.format(datetime.datetime.now()))
iterator = loader.__iter__()
while iterator:
item = iterator.__next__()
for label in item[0]["label"]:
print('Video is positive!') if label == 1 else print('Video is negative!')
print('Videos loaded at {}'.format(datetime.datetime.now()))
print('Finishing test!')
I want simply to load 100 frames of each video, then crop them randomly in the height and width dimension. As a first test, I didn’t want to use the whole dataset, so I used just a portion of it (we are talking about 4000/5000 videos in any case), but when I run the code I have encountered three major errors. I report them in “discovery order”, as after I have encountered the first one I have simplified my code reducing the task complexity too for doing a little debugging. I have DALI 0.16.0 installed, running the code on an Ubuntu machine with an E5-2630 CPU, 128GB of RAM and a single NVIDIA Quadro P6000 GPU.
The first error appears by simply running the script above as it is:
Traceback (most recent call last):
File "DALILoader.py", line 76, in <module>
loader = DALILoader(batch_size, file_list, seq_length, [0.0, 256.0, 256.0], 0)
File "DALILoader.py", line 59, in __init__
self.pipeline.build()
File "/nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/pipeline.py", line 308, in build
self._pipe.Build(self._names_and_devices)
RuntimeError: [/opt/dali/dali/operators/reader/loader/video_loader.cc:190] Could not open file /nas/public/dataset/1848521_1441897_A_000.mp4 because of Too many open files
Stacktrace (32 entries):
[frame 0]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x1434ae) [0x7f03e72794ae]
[frame 1]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x40c8db) [0x7f03e75428db]
[frame 2]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x412c72) [0x7f03e7548c72]
[frame 3]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x43730f) [0x7f03e756d30f]
[frame 4]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x438202) [0x7f03e756e202]
[frame 5]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(std::_Function_handler<std::unique_ptr<dali::OperatorBase, std::default_delete<dali::OperatorBase> > (dali::OpSpec const&), std::unique_ptr<dali::OperatorBase, std::default_delete<dali::OperatorBase> > (*)(dali::OpSpec const&)>::_M_invoke(std::_Any_data const&, dali::OpSpec const&)+0xc) [0x7f03e727476c]
[frame 6]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x131284) [0x7f03e5cd9284]
[frame 7]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(dali::InstantiateOperator(dali::OpSpec const&)+0x34e) [0x7f03e5cd87ce]
[frame 8]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(dali::OpGraph::InstantiateOperators()+0xa7) [0x7f03e5c91267]
[frame 9]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(dali::Pipeline::Build(std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > >)+0xad8) [0x7f03e5cf7858]
[frame 10]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/backend_impl.cpython-37m-x86_64-linux-gnu.so(+0x3758f) [0x7f03ed52e58f]
[frame 11]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/backend_impl.cpython-37m-x86_64-linux-gnu.so(+0x1fe03) [0x7f03ed516e03]
[frame 12]: python3(_PyMethodDef_RawFastCallKeywords+0x264) [0x55d62a49e6e4]
[frame 13]: python3(_PyCFunction_FastCallKeywords+0x21) [0x55d62a49e801]
[frame 14]: python3(_PyEval_EvalFrameDefault+0x537e) [0x55d62a4fa7ae]
[frame 15]: python3(_PyFunction_FastCallKeywords+0xfb) [0x55d62a49d79b]
[frame 16]: python3(_PyEval_EvalFrameDefault+0x6a0) [0x55d62a4f5ad0]
[frame 17]: python3(_PyFunction_FastCallDict+0x10b) [0x55d62a43c50b]
[frame 18]: python3(_PyObject_Call_Prepend+0xde) [0x55d62a453cbe]
[frame 19]: python3(+0x1710aa) [0x55d62a4960aa]
[frame 20]: python3(_PyObject_FastCallKeywords+0x128) [0x55d62a49e9b8]
[frame 21]: python3(_PyEval_EvalFrameDefault+0x4bf6) [0x55d62a4fa026]
[frame 22]: python3(_PyEval_EvalCodeWithName+0x2f9) [0x55d62a43b4f9]
[frame 23]: python3(PyEval_EvalCodeEx+0x44) [0x55d62a43c3c4]
[frame 24]: python3(PyEval_EvalCode+0x1c) [0x55d62a43c3ec]
[frame 25]: python3(+0x22f874) [0x55d62a554874]
[frame 26]: python3(PyRun_FileExFlags+0xa1) [0x55d62a55eb81]
[frame 27]: python3(PyRun_SimpleFileExFlags+0x1c3) [0x55d62a55ed73]
[frame 28]: python3(+0x23ae5f) [0x55d62a55fe5f]
[frame 29]: python3(_Py_UnixMain+0x3c) [0x55d62a55ff7c]
[frame 30]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f044837fb97]
[frame 31]: python3(+0x1e0122) [0x55d62a505122]
My first question therefore is:
- Is there a limit for the number of videos to be opened by a VideoReader? Obviously 100 frames of 4000 videos cannot fit on a GPU memory, but I have imagined that each video would be loaded singularly only at the
next()
call of theDALIGenericIterator
, so the frames would be loaded only when needed. Am I wrong? Moreover, for taking 100 frames of each video, is it right to havebatch_size=1
andseq_length=100
?
As a second experiment, I reduced the number of videos to 100. This time it seems that DALI is able to load the videos, but I got another error instead:
Traceback (most recent call last):
File "DALILoader.py", line 76, in <module>
loader = DALILoader(batch_size, file_list, seq_length, [0.0, 256.0, 256.0], 0)
File "DALILoader.py", line 64, in __init__
auto_reset=True)
File "/nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 147, in __init__
self._first_batch = self.next()
File "/nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 244, in next
return self.__next__()
File "/nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 162, in __next__
outputs.append(p.share_outputs())
File "/nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/pipeline.py", line 399, in share_outputs
return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline: [/opt/dali/dali/operators/fused/crop_mirror_normalize.h:155] Assert on "output_layout_.is_permutation_of(input_layout_)" failed: The requested output layout is not a permutation of input layout.
Stacktrace (11 entries):
[frame 0]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x1434ae) [0x7fed0a7c34ae]
[frame 1]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x79aa25) [0x7fed0ae1aa25]
[frame 2]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x760f00) [0x7fed0ade0f00]
[frame 3]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x3dcead) [0x7fed0aa5cead]
[frame 4]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0xc3c6d) [0x7fed091b5c6d]
[frame 5]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0xc4637) [0x7fed091b6637]
[frame 6]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x960e3) [0x7fed091880e3]
[frame 7]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x1139c6) [0x7fed092059c6]
[frame 8]: /nas/home/ecannas/miniconda3/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x6f6c90) [0x7fed097e8c90]
[frame 9]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fed6bca06db]
[frame 10]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fed6b9c988f]
Current pipeline object is no longer valid.
I am probably using the crop operation wrong, so
- Is it right to have the CropMirrorNormalize working on the input[0] element? I expect that element to be the 100 frames batch tensor, with the input[1] element being the label instead. Am I guessing right? Is something wrong in my code or in the way I am using the CropMirrorNormalize operation?
Finally, as a last experiment I have removed the CropMirrorNormalize operation and built the pipeline using the VideoReader only. This time the code runs with no error, but it seems to “stop” after loading 3 videos only. The terminal stayed “freezed” for several minutes, and I had to kill the process. So, I am wondering
- Do you have any guess for this behaviour?
I hope that my post is comprehensible and I apologize in advance for asking maybe too many non-related questions altogether, but I could not find any answer in the docs or in other issues here on GitHub.
Thank you in advance!
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 48 (27 by maintainers)
Hey @JanuszL ,
I think I finally found the root of the problem.
You were right! As you were suggesting, I found out that some of the videos present a resolution greater than the 1920x1080 of the full HD! What happened here
is that the successive video in the list (video number 68) has a resolution of 3840x2160 pixels; while prefetching the successive batch with DALI the GPU runs out of memory and therefore from here the
CUDA allocation failed
error pops out. Reducing thesequence_length
allowed me to see the allocation of the biggest frames in the GPU and the “spike” in the memory consumption; until #1643 is merged, I will probably work with shorter sequences.Speaking of this, I would like to use the
stride
argument of theVideoReader
. If I have a video of, let’s say, 10 (numbered) frames, like this[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
, and I would like to have a sequence ofsequence_length=5
andstride=2
, this means that the resulting sequence will contain one frame every two right? Resulting in something like this[0, 2, 4, 6, 8]
?Thank you really really much for your help! This code had kept me busy for weeks now, without your assistance I could never make it work!
Thank You for your response. I issue is resolved right now.
https://github.com/NVIDIA/DALI/pull/1643 should reduce memory consumption
@CrohnEngineer - I see one incomplete implementation in DALI. Even you ask the VideoReader for
dtype=types.UINT8
it internally allocates memory for float32 data. I will fix that soon, it should reduce memory occupation 4 times (I hope).