vision: VideoClips Assertion Error
Hello,
I’m trying to load a big video. Following https://github.com/pytorch/vision/issues/1446 I used a VideoClips object, but it’s crashing when trying to get clips with certain ids with this error:
AssertionError Traceback (most recent call last) <ipython-input-9-6e97949ad7f5> in <module>() ----> 1 x = video_clips.get_clip(1)
/usr/local/lib/python3.6/dist-packages/torchvision/datasets/video_utils.py in get_clip(self, idx) 324 video = video[resampling_idx] 325 info[“video_fps”] = self.frame_rate –> 326 assert len(video) == self.num_frames, “{} x {}”.format(video.shape, self.num_frames) 327 return video, audio, info, video_idx
AssertionError: torch.Size([0, 1, 1, 3]) x 32
The code I use is just this:
from torchvision.datasets.video_utils import VideoClips
video_clips = VideoClips(["test_video.mp4"], clip_length_in_frames=32, frames_between_clips=32)
for i in range(video_clips.num_clips()):
x = video_clips.get_clip(i)
video_clips.num_clips()
is much bigger than the ids that are failing. Changing the clipt_length or frames_between doesn’t help.
Checking the code I see [0,1,1,3] is returned by read_video
when no vframes are read:
https://github.com/pytorch/vision/blob/85b8fbfd31e9324e64e24ca25410284ef238bcb3/torchvision/io/video.py#L251-L254
But, for some clip ids and clip_lengths it’s just that the sizes don’t match, as the assertion error is something like this AssertionError: torch.Size([19, 360, 640, 3]) x 128
I followed the issue to _read_from_stream
and checked no AV exceptions where raised. And running this part of the function:
https://github.com/pytorch/vision/blob/85b8fbfd31e9324e64e24ca25410284ef238bcb3/torchvision/io/video.py#L144-L150
I saw that for an start_pts=32032
, end_pts=63063
it returned just one frame on frames
with pts=237237
. Which is later discarted as it’s a lot bigger than end_pts
.
Also, the stream.time_base
is Fraction(1, 24000)
which doesn’t match the start and end pts provided by VideoClips.
So it seems there is a problem with the seeking on my video. But it has a standard h264 encoding and I have no problem reading it sequentially with pyav.
I’m wondering if I’m doing something wrong or there might be an issue with the read_video
seeking (as the warning says it should be using seconds?).
This is the video info according to ffmpeg:
Metadata: major_brand : mp42 minor_version : 0 compatible_brands: mp42isom creation_time : 2016-10-10T15:36:46.000000Z Duration: 00:21:24.37, start: 0.000000, bitrate: 1002 kb/s Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 640x360 [SAR 1:1 DAR 16:9], 900 kb/s, 23.98 fps, 23.98 tbr, 24k tbn, 47.95 tbc (default) Metadata: handler_name : Telestream Inc. Telestream Media Framework - Release TXGP 2016.42.192059 encoder : AVC Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 93 kb/s (default) Metadata: handler_name : Telestream Inc. Telestream Media Framework - Release TXGP 2016.42.192059
Thanks!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 20 (15 by maintainers)
I am running into a similar issue, where the VideoClips instance is returns exactly one more frame than expected (tested with several values).
I am using PyAV as a backend on
torch=1.12.1
andtorchvision=0.12.0
. Dataset is Kinetics downloaded form the S3 bucket referenced in the Kinetics dataset class.I have no idea of how to solve this, or if it’s even a problem. I could just drop the last frame, but that doesn’t seem like what I should do.
This is still an issue that was recently re-introduced in https://github.com/pytorch/vision/pull/3791
This is the same problem as https://github.com/pytorch/vision/issues/4839 and https://github.com/pytorch/vision/issues/4112
Raising the priority to high because it’s been broken for several months already
I still run into this issue
Hello,
We digged a bit more in this and found that setting
should_buffer
to True fixes the issue: https://github.com/pytorch/vision/blob/85b8fbfd31e9324e64e24ca25410284ef238bcb3/torchvision/io/video.py#L110The problem is in this section that reads the frames: https://github.com/pytorch/vision/blob/85b8fbfd31e9324e64e24ca25410284ef238bcb3/torchvision/io/video.py#L144-L150
PTS might not be read in order and this causes the break to happen before all the relevant frames have been read.
For example in our case our
end_offset
is 15 but first a frame with PTS 15 is received and then one with PTS 14. So we hit the break without reading frame 14 and we crash latter on the assert for size.It seems this can happen with AVI videos, I found this discussion on PyAV relevant https://github.com/PyAV-Org/PyAV/issues/534. We confirm we are in a similar case, our AVI video has frames without PTS as it is not strictly required.
Setting the
should_buffer
to true seems a good solution, is there any reason why this is set to false or not exposed as a parameter? Another solution could be doing a hard compareframe.pts == end_offset
i’m not fully sure if this always happens, but if end_offset is chosen as in VideoClips (selecting keyframes) it should work too.