dlib: Training mmod detector crashes when using bounding box regression
I am training a normal mmod detector like the ones in the examples, but I enable bounding box regression with:
net.subnet().layer_details().set_num_filters(5 * options.detector_windows.size());
Current Behavior
I get the following error really often:
Error detected at line 1591.
Error detected in file ../../external/dlib/dlib/../dlib/dnn/loss.h.
which refers to: https://github.com/davisking/dlib/blob/b401185aa5a59bfff8eb5f4675a7e4802c37b070/dlib/dnn/loss.h#L1591-L1592
I was not getting this error with the same code a while ago, but many things have changed since (especially CUDA versions). Most of the time, this error happens at the beginning of the training, when everything is very chaotic. However, sometimes I get the error after the loss has stabilized, as well.
I tried updating the gradient for h and w only when they are positive and let the gradient be 0 otherwise.
This avoids the crashing, but messes up the training (loss goes to inf).
I’ve also noticed this didn’t happen when I changed the lambda value to much lower values, such as 1, instead of the default 100. Maybe it’s just that the lambda is too big? EDIT: it also happens.
How would you proceed?
- Version: dlib master (19.21.99)
- Where did you get dlib: github
- Platform: Linux 64 bit, CUDA 11.0.2, CUDNN 8.0.2.39
- Compiler: GCC-9.3.0 for C (in order to enable CUDA) and GCC-10.2 for C++
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (19 by maintainers)
I just pushed a change to dlib to use a robust statistic for loss explosion detection. So the next training run that sees the kind of loss spike you posted should automatically backtrack and not require any user intervention.
That kind of loss spike can and does happen normally, so I wouldn’t worry about that with regard to this thread. However, the
dnn_trainershould have automatically reloaded from the previous state when it happened. However, it appears that that initial insanely large spike threw off the loss slope confidence interval check. I’ll post a change that will make the test robust to such things in a bit.Ha, definitely happens to everyone 😃
You are the proof that these things happen to the best of us 😉
Yeah, I’m playing with this myself as well and fairly strongly leaning towards switching to
rect_bbr. Usingrect_bbris mathematically right whilerectis wrong. So I expect it to be better as you noticed. I’m going to let a few more things run though to make sure it’s not unstable when usingrect_bbrthough.Training now with the exact same settings as before. The only change is the NMS on the
rect_bbr. I’ll let you know how it goes 😃Yeah, we are using it in the output you get as the user, but the loss computation isn’t looking at it. You would need to make this change to have the loss use
rect_bbrtoo.Na, you shouldn’t have to do any of that hacky stuff.
I just looked at the code and oops, it’s totally wrong. It should be like this:
Either that or it’s too late and I’m tired and should go to sleep. But unless I’m just missing or forgetting something that’s happening elsewhere in the code that somehow makes what’s there right, it should be using
rectnotrect_bbrthere. I’m launching a training session now to see if this works right/better. I’ll look more tomorrow. But try the above and see if it works better.