DIGITS: Error running inference on CPU for network with BatchNorm

Hey guys!

I was using the Batch Normalization in my network with digits 3.0.

This was the end of my network (which was working fine):

layer {
  bottom: "loss1/fc/bn"
  top: "loss1/classifier"
  name: "loss1/classifier"
  type: "InnerProduct"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    #num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "loss1/loss"
  type: "SoftmaxWithLoss"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "loss"
  loss_weight: 1
}
layer {
  name: "loss1/top-1"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy"
  include { stage: "val" }
}
layer {
  name: "loss1/top-5"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy-top5"
  accuracy_param {
    top_k: 5
  }
}

Now, after updating to the master branch 3.3 I had To change the end of my network for the new include { stage: "deploy" } definitions.

Thus, the end of the network starts looking like this :

layer {
  bottom: "loss1/fc/bn"
  top: "loss1/classifier"
  name: "loss1/classifier"
  type: "InnerProduct"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    #num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "loss1/loss"
  type: "SoftmaxWithLoss"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "loss"
  loss_weight: 1
  exclude { stage: "deploy" }
}
layer {
  name: "loss1/top-1"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy"
  include { stage: "val" }
}
layer {
  name: "loss1/top-5"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy-top5"
  include { stage: "val" }
  accuracy_param {
    top_k: 5
  }
}
layer {
  name: "softmax"
  type: "Softmax"
  bottom: "loss1/classifier"
  top: "softmax"
  include { stage: "deploy" }
}

The issue is that now the Classify One softmax, seems not to be using the terms of the softmax. I cannot classify an image that I could before.

I cannot point my finger where exactly. But It seems to be related to include { stage: "deploy" } that is not mandatory.

It also seems to be a big issue when using the Batch Normalization …

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 16 (7 by maintainers)

Most upvoted comments

If all your GPUs are used for training then Caffe will resort to using the CPU for inference. That might be why you are getting this error.

About the change introduced in #573, I perfectly see the reason (trainings were crashing in 3.0), but is it possible to implement an option marked “non-default”, “unsafe”, “deprecated”, “use at your own risk” that restores the previous behavior? Sometimes it’s really handy to make a quick test while training a network, to see how good (or, in my case, how bad) things are going, and one can decide to take the chance when running a training about half of the device memory or less.

It doesn’t look like BatchNorm is unavailable on the CPU. It looks to me like this is the issue:

[WARNING] Infer Model unrecognized output: /usr/lib/python2.7/dist-packages/numpy/core/_methods.py:102: RuntimeWarning: overflow encountered in multiply
[WARNING] Infer Model unrecognized output: x = um.multiply(x, x, out=x)

I don’t know why the CPU would encounter an overflow when the GPU doesn’t though…