dlib: Training mmod detector crashes when using bounding box regression

I am training a normal mmod detector like the ones in the examples, but I enable bounding box regression with:

net.subnet().layer_details().set_num_filters(5 * options.detector_windows.size());

Current Behavior

I get the following error really often:

Error detected at line 1591.                                       
Error detected in file ../../external/dlib/dlib/../dlib/dnn/loss.h.

which refers to: https://github.com/davisking/dlib/blob/b401185aa5a59bfff8eb5f4675a7e4802c37b070/dlib/dnn/loss.h#L1591-L1592

I was not getting this error with the same code a while ago, but many things have changed since (especially CUDA versions). Most of the time, this error happens at the beginning of the training, when everything is very chaotic. However, sometimes I get the error after the loss has stabilized, as well.

I tried updating the gradient for h and w only when they are positive and let the gradient be 0 otherwise. This avoids the crashing, but messes up the training (loss goes to inf).

I’ve also noticed this didn’t happen when I changed the lambda value to much lower values, such as 1, instead of the default 100. Maybe it’s just that the lambda is too big? EDIT: it also happens.

How would you proceed?

  • Version: dlib master (19.21.99)
  • Where did you get dlib: github
  • Platform: Linux 64 bit, CUDA 11.0.2, CUDNN 8.0.2.39
  • Compiler: GCC-9.3.0 for C (in order to enable CUDA) and GCC-10.2 for C++

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 19 (19 by maintainers)

Most upvoted comments

I just pushed a change to dlib to use a robust statistic for loss explosion detection. So the next training run that sees the kind of loss spike you posted should automatically backtrack and not require any user intervention.

That kind of loss spike can and does happen normally, so I wouldn’t worry about that with regard to this thread. However, the dnn_trainer should have automatically reloaded from the previous state when it happened. However, it appears that that initial insanely large spike threw off the loss slope confidence interval check. I’ll post a change that will make the test robust to such things in a bit.

Ha, definitely happens to everyone 😃

You are the proof that these things happen to the best of us 😉

Yeah, I’m playing with this myself as well and fairly strongly leaning towards switching to rect_bbr. Using rect_bbr is mathematically right while rect is wrong. So I expect it to be better as you noticed. I’m going to let a few more things run though to make sure it’s not unstable when using rect_bbr though.

Training now with the exact same settings as before. The only change is the NMS on the rect_bbr. I’ll let you know how it goes 😃

Yeah, we are using it in the output you get as the user, but the loss computation isn’t looking at it. You would need to make this change to have the loss use rect_bbr too.

diff --git a/dlib/dnn/loss.h b/dlib/dnn/loss.h
index a2fb0790..7ac40a7b 100644
--- a/dlib/dnn/loss.h
+++ b/dlib/dnn/loss.h
@@ -1487,7 +1487,7 @@ namespace dlib
                 // The point of this loop is to fill out the truth_score_hits array. 
                 for (size_t i = 0; i < dets.size() && final_dets.size() < max_num_dets; ++i)
                 {
-                    if (overlaps_any_box_nms(final_dets, dets[i].rect))
+                    if (overlaps_any_box_nms(final_dets, dets[i].rect_bbr))
                         continue;
 
                     const auto& det_label = options.detector_windows[dets[i].tensor_channel].label;
@@ -1556,7 +1556,7 @@ namespace dlib
                 // detections.
                 for (unsigned long i = 0; i < dets.size() && final_dets.size() < max_num_dets; ++i)
                 {
-                    if (overlaps_any_box_nms(final_dets, dets[i].rect))
+                    if (overlaps_any_box_nms(final_dets, dets[i].rect_bbr))
                         continue;
 
                     const auto& det_label = options.detector_windows[dets[i].tensor_channel].label;

Na, you shouldn’t have to do any of that hacky stuff.

I just looked at the code and oops, it’s totally wrong. It should be like this:

diff --git a/dlib/dnn/loss.h b/dlib/dnn/loss.h
index e4b913a3..a2fb0790 100644
--- a/dlib/dnn/loss.h
+++ b/dlib/dnn/loss.h
@@ -1582,9 +1582,9 @@ namespace dlib
                                     double dw = out_data[dets[i].tensor_offset_dw];
                                     double dh = out_data[dets[i].tensor_offset_dh];
 
-                                    dpoint p = dcenter(dets[i].rect_bbr); 
-                                    double w = dets[i].rect_bbr.width()-1;
-                                    double h = dets[i].rect_bbr.height()-1;
+                                    dpoint p = dcenter(dets[i].rect);
+                                    double w = dets[i].rect.width()-1;
+                                    double h = dets[i].rect.height()-1;
                                     drectangle truth_box = (*truth)[hittruth.second].rect;
                                     dpoint p_truth = dcenter(truth_box); 

Either that or it’s too late and I’m tired and should go to sleep. But unless I’m just missing or forgetting something that’s happening elsewhere in the code that somehow makes what’s there right, it should be using rect not rect_bbr there. I’m launching a training session now to see if this works right/better. I’ll look more tomorrow. But try the above and see if it works better.