vision: VideoClips Assertion Error

Hello,

I’m trying to load a big video. Following https://github.com/pytorch/vision/issues/1446 I used a VideoClips object, but it’s crashing when trying to get clips with certain ids with this error:

AssertionError Traceback (most recent call last) <ipython-input-9-6e97949ad7f5> in <module>() ----> 1 x = video_clips.get_clip(1)

/usr/local/lib/python3.6/dist-packages/torchvision/datasets/video_utils.py in get_clip(self, idx) 324 video = video[resampling_idx] 325 info[“video_fps”] = self.frame_rate –> 326 assert len(video) == self.num_frames, “{} x {}”.format(video.shape, self.num_frames) 327 return video, audio, info, video_idx

AssertionError: torch.Size([0, 1, 1, 3]) x 32

The code I use is just this:

from torchvision.datasets.video_utils import VideoClips
video_clips = VideoClips(["test_video.mp4"], clip_length_in_frames=32, frames_between_clips=32)
for i in range(video_clips.num_clips()):
    x = video_clips.get_clip(i)

video_clips.num_clips() is much bigger than the ids that are failing. Changing the clipt_length or frames_between doesn’t help.

Checking the code I see [0,1,1,3] is returned by read_video when no vframes are read: https://github.com/pytorch/vision/blob/85b8fbfd31e9324e64e24ca25410284ef238bcb3/torchvision/io/video.py#L251-L254 But, for some clip ids and clip_lengths it’s just that the sizes don’t match, as the assertion error is something like this AssertionError: torch.Size([19, 360, 640, 3]) x 128

I followed the issue to _read_from_stream and checked no AV exceptions where raised. And running this part of the function: https://github.com/pytorch/vision/blob/85b8fbfd31e9324e64e24ca25410284ef238bcb3/torchvision/io/video.py#L144-L150 I saw that for an start_pts=32032, end_pts=63063 it returned just one frame on frames with pts=237237. Which is later discarted as it’s a lot bigger than end_pts.

Also, the stream.time_base is Fraction(1, 24000) which doesn’t match the start and end pts provided by VideoClips.

So it seems there is a problem with the seeking on my video. But it has a standard h264 encoding and I have no problem reading it sequentially with pyav. I’m wondering if I’m doing something wrong or there might be an issue with the read_video seeking (as the warning says it should be using seconds?).

This is the video info according to ffmpeg:

Metadata: major_brand : mp42 minor_version : 0 compatible_brands: mp42isom creation_time : 2016-10-10T15:36:46.000000Z Duration: 00:21:24.37, start: 0.000000, bitrate: 1002 kb/s Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 640x360 [SAR 1:1 DAR 16:9], 900 kb/s, 23.98 fps, 23.98 tbr, 24k tbn, 47.95 tbc (default) Metadata: handler_name : Telestream Inc. Telestream Media Framework - Release TXGP 2016.42.192059 encoder : AVC Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 93 kb/s (default) Metadata: handler_name : Telestream Inc. Telestream Media Framework - Release TXGP 2016.42.192059

Thanks!

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 20 (15 by maintainers)

Most upvoted comments

I am running into a similar issue, where the VideoClips instance is returns exactly one more frame than expected (tested with several values).

I am using PyAV as a backend on torch=1.12.1 and torchvision=0.12.0. Dataset is Kinetics downloaded form the S3 bucket referenced in the Kinetics dataset class.

I have no idea of how to solve this, or if it’s even a problem. I could just drop the last frame, but that doesn’t seem like what I should do.

This is still an issue that was recently re-introduced in https://github.com/pytorch/vision/pull/3791

This is the same problem as https://github.com/pytorch/vision/issues/4839 and https://github.com/pytorch/vision/issues/4112

Raising the priority to high because it’s been broken for several months already

I still run into this issue

Hello,

We digged a bit more in this and found that setting should_buffer to True fixes the issue: https://github.com/pytorch/vision/blob/85b8fbfd31e9324e64e24ca25410284ef238bcb3/torchvision/io/video.py#L110

The problem is in this section that reads the frames: https://github.com/pytorch/vision/blob/85b8fbfd31e9324e64e24ca25410284ef238bcb3/torchvision/io/video.py#L144-L150

PTS might not be read in order and this causes the break to happen before all the relevant frames have been read.

For example in our case our end_offset is 15 but first a frame with PTS 15 is received and then one with PTS 14. So we hit the break without reading frame 14 and we crash latter on the assert for size.

It seems this can happen with AVI videos, I found this discussion on PyAV relevant https://github.com/PyAV-Org/PyAV/issues/534. We confirm we are in a similar case, our AVI video has frames without PTS as it is not strictly required.

Setting the should_buffer to true seems a good solution, is there any reason why this is set to false or not exposed as a parameter? Another solution could be doing a hard compare frame.pts == end_offset i’m not fully sure if this always happens, but if end_offset is chosen as in VideoClips (selecting keyframes) it should work too.