coremltools: std::runtime_error: BNNS error when trying to do training

❓Question

I am trying to move TensorFlow Keras model from running on a server to the device.

I started off converting the model with createml and it worked like a charm. I could make predictions in no time, thanks for the great framework.

My issue started yesterday when I wanted to also start and do personalization/training on the device.

I updated the createml script so the model would also be trainable. This worked fine and I can see all the new info in xCode under Update and Parameters for my model. All looks fine.

However, when trying to use MLUpdateTask, it always crashes and the only output I have is: std::runtime_error: BNNS error

I have created a minimal example, in theory removing anything specific to my use-case but it keeps producing the same error.

I have tried the emoji drawing example and it runs fine on the same device so it must be either that I am doing something wrong or that there is something not compatible with the converted model. Everything is, however, compiling fine.

I am not using MLFeatureValue directly but the generated iemocapTrainingInput : MLFeatureProvider assuming this is ok.

My minimal example looks like this (it fails in the same way as when running with my real training data):

static func minExample() {
        // Input as 1x251x168 empty MLMultiArray
        guard let features = try? MLMultiArray(shape:[1,251,168], dataType:MLMultiArrayDataType.float32) else {
            fatalError("Unexpected runtime error when creating MLMultiArray")
        }
        
        // Output as single value (2) for testing
        guard let trainLabel = try? MLMultiArray.init([2]) else {
            fatalError("Unexpected error when creating trainLabel")
        }
        
        let trainingSample = iemocapTrainingInput(input1: features, output1_true: trainLabel)
        
        var trainingSamples = [MLFeatureProvider]()
        trainingSamples.append(trainingSample)
        let updatableModelURL = Bundle.main.url(forResource: "iemocap",
                                                withExtension: "mlmodelc")!
        
        do {
            let updateTask = try MLUpdateTask(forModelAt: updatableModelURL,
                                              trainingData: MLArrayBatchProvider(array: trainingSamples),
                                              configuration: nil) { _ in
                                                print("Completed training")
            }
            updateTask.resume()
        } catch {
            print("Failed to start update task")
        }
    }

The model description looks like this: Screenshot 2019-10-15 at 14 09 31

and the crash always occur here: Screenshot 2019-10-15 at 14 10 22

When I connected a ProgressHandler as shown in your example here: https://github.com/apple/coremltools/blob/master/examples/updatable_models/OnDeviceTraining_API_Usage.md

I got one callback for .trainingBegin but then it crashed so it seems to find the model and starts to do something before the BNNS error.

I have been with this for 2 days now and running out of ideas so all suggestions are welcome.

Thanks in advance

System Information

iOS13.2

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 40

Most upvoted comments

Thank you so much @mrfarhadi. I’m able to train now and validate performance for my real scenario now. Btw, I can’t wait for ‘next OS release’ !!

JacopoMangiavacchi on Apr 16, 2020

@JacopoMangiavacchi Glad to hear that!

1- Seems to me that you probably need to change the kernel size of the second conv layer to (2, 2) Here is the snippet to create similar model in Keras:

from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers import Conv2D, MaxPooling2D

input_shape = (28, 28, 1)
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, kernel_size=(2, 2), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(500, activation='relu'))
model.add(Dense(10, activation='softmax'))

model.summary()

2 - No. LSTM layers are not trainable. Sorry for the confusion in the code.

mrfarhadi on Apr 19, 2020

@JacopoMangiavacchi Thanks for the modification. Looked at your commit and it looks good. We passed the first issue. I tried your app on my end with the ‘current’ OS and got the same issue you are facing. However, I can provide you a workaround. First of all keep in mind that this issue will be gone by next OS release. For now, you can cut the first conv layer and your model becomes:

                            Convolution(name: "conv2",
                                        input: ["image"],
                                        output: ["outConv2"],
                                        outputChannels: 32,
                                        kernelChannels: 1,
                                        nGroups: 1,
                                        kernelSize: [3, 3],
                                        stride: [1, 1],
                                        dilationFactor: [1, 1],
                                        paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                                            EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                                        outputShape: [],
                                        deconvolution: false,
                                        updatable: true)
                            ReLu(name: "relu2",
                                 input: ["outConv2"],
                                 output: ["outRelu2"])
                            Pooling(name: "pooling2",
                                    input: ["outRelu2"],
                                    output: ["outPooling2"],
                                    poolingType: .max,
                                    kernelSize: [2, 2],
                                    stride: [2, 2],
                                    paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                                        EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                                    avgPoolExcludePadding: true,
                                    globalPooling: false)
                            Convolution(name: "conv3",
                                        input: ["outPooling2"],
                                        output: ["outConv3"],
                                        outputChannels: 32,
                                        kernelChannels: 32,
                                        nGroups: 1,
                                        kernelSize: [2, 2],
                                        stride: [1, 1],
                                        dilationFactor: [1, 1],
                                        paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                                            EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                                        outputShape: [],
                                        deconvolution: false,
                                        updatable: true)
                            ReLu(name: "relu3",
                                 input: ["outConv3"],
                                 output: ["outRelu3"])
                            Pooling(name: "pooling3",
                                    input: ["outRelu3"],
                                    output: ["outPooling3"],
                                    poolingType: .max,
                                    kernelSize: [2, 2],
                                    stride: [2, 2],
                                    paddingType: .valid(borderAmounts: [EdgeSizes(startEdgeSize: 0, endEdgeSize: 0),
                                                                        EdgeSizes(startEdgeSize: 0, endEdgeSize: 0)]),
                                    avgPoolExcludePadding: true,
                                    globalPooling: false)
                            Flatten(name: "flatten1",
                                    input: ["outPooling3"],
                                    output: ["outFlatten1"],
                                    mode: .last)
                            InnerProduct(name: "hidden1",
                                         input: ["outFlatten1"],
                                         output: ["outHidden1"],
                                         inputChannels: 1152,
                                         outputChannels: 500,
                                         updatable: true)
                            ReLu(name: "relu4",
                                 input: ["outHidden1"],
                                 output: ["outRelu4"])
                            InnerProduct(name: "hidden2",
                                         input: ["outRelu4"],
                                         output: ["outHidden2"],
                                         inputChannels: 500,
                                         outputChannels: 10,
                                         updatable: true)
                            Softmax(name: "softmax",
                                    input: ["outHidden2"],
                                    output: ["output"])

You should be able to train above model with the current OS.

mrfarhadi on Apr 16, 2020

Fantastic, totally make sense. Thank you @mrfarhadi I’ll update the sample according to your suggestions and I’ll keep you updated.

JacopoMangiavacchi on Apr 14, 2020

@JacopoMangiavacchi I am sorry, I confused your issue with something else. Seems the training stops in your case.

Looking at your model, it should be OK to mark conv layer updatable as well.

Can you file a bug report (bugreport.apple.com) and include some sample data and the code you use to update the model?

mrfarhadi on Apr 7, 2020