DIGITS: Error running inference on CPU for network with BatchNorm
Hey guys!
I was using the Batch Normalization in my network with digits 3.0.
This was the end of my network (which was working fine):
layer {
bottom: "loss1/fc/bn"
top: "loss1/classifier"
name: "loss1/classifier"
type: "InnerProduct"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
#num_output: 1000
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "loss1/loss"
type: "SoftmaxWithLoss"
bottom: "loss1/classifier"
bottom: "label"
top: "loss"
loss_weight: 1
}
layer {
name: "loss1/top-1"
type: "Accuracy"
bottom: "loss1/classifier"
bottom: "label"
top: "accuracy"
include { stage: "val" }
}
layer {
name: "loss1/top-5"
type: "Accuracy"
bottom: "loss1/classifier"
bottom: "label"
top: "accuracy-top5"
accuracy_param {
top_k: 5
}
}
Now, after updating to the master branch 3.3 I had To change the end of my network for the new include { stage: "deploy" }
definitions.
Thus, the end of the network starts looking like this :
layer {
bottom: "loss1/fc/bn"
top: "loss1/classifier"
name: "loss1/classifier"
type: "InnerProduct"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
#num_output: 1000
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "loss1/loss"
type: "SoftmaxWithLoss"
bottom: "loss1/classifier"
bottom: "label"
top: "loss"
loss_weight: 1
exclude { stage: "deploy" }
}
layer {
name: "loss1/top-1"
type: "Accuracy"
bottom: "loss1/classifier"
bottom: "label"
top: "accuracy"
include { stage: "val" }
}
layer {
name: "loss1/top-5"
type: "Accuracy"
bottom: "loss1/classifier"
bottom: "label"
top: "accuracy-top5"
include { stage: "val" }
accuracy_param {
top_k: 5
}
}
layer {
name: "softmax"
type: "Softmax"
bottom: "loss1/classifier"
top: "softmax"
include { stage: "deploy" }
}
The issue is that now the Classify One softmax, seems not to be using the terms of the softmax. I cannot classify an image that I could before.
I cannot point my finger where exactly. But It seems to be related to include { stage: "deploy" }
that is not mandatory.
It also seems to be a big issue when using the Batch Normalization …
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 16 (7 by maintainers)
If all your GPUs are used for training then Caffe will resort to using the CPU for inference. That might be why you are getting this error.
About the change introduced in #573, I perfectly see the reason (trainings were crashing in 3.0), but is it possible to implement an option marked “non-default”, “unsafe”, “deprecated”, “use at your own risk” that restores the previous behavior? Sometimes it’s really handy to make a quick test while training a network, to see how good (or, in my case, how bad) things are going, and one can decide to take the chance when running a training about half of the device memory or less.
It doesn’t look like BatchNorm is unavailable on the CPU. It looks to me like this is the issue:
I don’t know why the CPU would encounter an overflow when the GPU doesn’t though…