mmaction2: Training process stuck at the beginning
I am going to training on my 4 action classes dataset, this is the dataset for reproducing this error.
At first, i trained it on colab, but it stuck at 70th iterration in 1st epoch, then i change cfg.data.videos_per_gpu from 8 to 4 , then it stuck at 140th iterration in 1st epoch. I am wonderring why it stuck at the same place ? Is the problem of my video dataset or the problem of code ?

Then, i trained it on my computer whose GPU is GTX 1080 Ti, this time it stucked exactly at the beginning, it has been stuck for several hours, it is similar to this issue, this issue still occurred even i set workers_per_gpu=0

About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (13 by maintainers)
Hi @rlleshi , afterwards i used
subclipinstead offfmpeg_extract_subclip, the script is :Although the videos outputed by this script won’t make training freeze anymore, the width and height of some videos are changed…
Yes, the videos could be the possible cause of training freeze, because when people are preparing their own video dataset, in this process it may happen some unpredictable faults, in my case, that is the video meta info is unmatched with its real length, that is to say, i expect to cut video to 5 seconds, but the output video show 13 seconds in whick the last 8 seconds are null, this issue may only happen to me, but other kinds of video issue may happend to other peoples.
Maybe i have found the reason, it is very likely a problem with the video. During i run the
build_rawframe.py, i found some video happened following warning ‘Early stop with {i + 1} out of {len(vr)} frames.’ :these problematic videos were cut out from some long videos with
moviepy, so i modified the python script and re-generate videos withmoviepy, this time all videos are normal and won’t happen above warning again. Now i have succefully trained 4 epochs without stuck.I have tried decoder == 0.4.1, it didn’t work. I will try RawFrames these days.
Could you try to train on RawFrames? This can rule out the error caused by decoders. BTW if you are using decord, please use the version == 0.4.1. They introduced bugs on 0.4.2