tensorflow: [TFLite C++] Signature calculating CategoricalCrossentropy loss produces wrong result

Click to expand!

Issue Type

Bug

Have you reproduced the bug with TF nightly?

Yes

Source

source

Tensorflow Version

2.13

Custom Code

Yes

OS Platform and Distribution

Windows 10

Mobile device

No response

Python version

No response

Bazel version

5.3.0

GCC/Compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current Behaviour?

I’ve created a simple model in Python (TF version 2.10) and converted it for tflite. The model has two signatures, one for inference and other for training. When I run those signatures in Python, everything works correctly, I get good inference result and good training loss. When I load the converted tflite model with the C++ TFLite API (built from source, from branch r2.13) and run those signatures: inference works as intended, training works as intended (the accuracy on the test set is steadily rising), but the reported loss is totally random. At first I thought that loss might be accumulated since it is rising to five digits, but that is not the case since it rises and falls in a random fashion. It seems like there is some bug in the ops used for CategoricalCrossentropy C++ TFLite implementation.

I’ve tried building tensorflow from r2.12 and r2.13 and I get the same behavior. I’ve tried r2.10 also but then I couldn’t even run the signatures with C++ TFLite API, I was getting bunch of segmentation faults. I couldn’t find anywhere the documentation on what ops for backward prop are available in C++ TFLite API, maybe some of those which are used in CategoricalCrossentropy loss calculation are not yet available, or there is a bug in their implementation.

Standalone code to reproduce the issue

Here is a Python code I am using to create model with signatures:

IMG_SIZE = 28

class Model(tf.Module):
    def __init__(self):
        self.model = tf.keras.Sequential([
            tf.keras.layers.Flatten(input_shape=(IMG_SIZE, IMG_SIZE), name='flatten'),
            tf.keras.layers.Dense(
                units=10,
                kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05),
                bias_initializer=tf.keras.initializers.Ones(),
                name='dense'
            ),
        ])

        opt = tf.keras.optimizers.SGD(learning_rate=0.1)
        loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
        self.model.compile(optimizer=opt, loss=loss_fn, metrics=['accuracy'])

    # The `train` function takes a batch of input images and labels.
    @tf.function(input_signature=[
        tf.TensorSpec([32, IMG_SIZE, IMG_SIZE], tf.float32),
        tf.TensorSpec([32, 10], tf.float32),
    ])
    def train(self, x, y):
        with tf.GradientTape() as tape:
            prediction = self.model(x)
            loss = self.model.loss(y, prediction)
        gradients = tape.gradient(loss, self.model.trainable_variables)
        self.model.optimizer.apply_gradients(
            zip(gradients, self.model.trainable_variables))
        result = {"loss": loss}
        return result

    @tf.function(input_signature=[
        tf.TensorSpec([1, IMG_SIZE, IMG_SIZE], tf.float32),
    ])
    def infer(self, x):
        logits = self.model(x)
        probabilities = tf.nn.softmax(logits, axis=-1)
        return {
            "output": probabilities,
            "logits": logits
        }

And here is the C++ code I am using to run the tflite model:

std::unique_ptr<tflite::FlatBufferModel> model =
    tflite::FlatBufferModel::BuildFromFile(tflite_model_path);
if (model == nullptr)
{
    std::cout << "Failed to load model" << std::endl;
    return;
}

tflite::ops::builtin::BuiltinOpResolver resolver;
tflite::InterpreterBuilder builder(*model, resolver);
std::unique_ptr<tflite::Interpreter> interpreter;
builder(&interpreter);
if (interpreter == nullptr)
{
    std::cout << "Failed to create interpreter" << std::endl;
    return;
}

if (interpreter->AllocateTensors() != kTfLiteOk)
{
    std::cout << "Failed to alocate interpreter tensors" << std::endl;
    return;
}

tflite::SignatureRunner* train_runner = interpreter->GetSignatureRunner("train");

TfLiteTensor* input_data_tensor = train_runner->input_tensor(train_runner->input_names()[0]);
float* input_data = input_data_tensor->data.f;
TfLiteTensor* input_labels_tensor = train_runner->input_tensor(train_runner->input_names()[1]);
float* input_labels = input_labels_tensor->data.f;

// Here I fill in the input data and labels, code redacted for brevity.

if (train_runner->Invoke() != kTfLiteOk)
{
    std::cout << "Error invoking train interpreter signature" << std::endl;
    return;
}

const TfLiteTensor* output_tensor = train_runner->output_tensor(train_runner->output_names()[0]);
float* output = output_tensor->data.f;
std::cout << "Training finished with loss: " << output[0] << std::endl;

Please let me know if you need more details, or full source code.

Relevant log output

Here are the losses from batch to batch, as you can see they are too high and pretty much random. I repeat: the model is training correctly which I can see because the accuracy on the test set is steadily rising, so these loss values do not make sense.

Training of batch 1 finished with loss: 172.813
Training of batch 2 finished with loss: 30406.2
Training of batch 3 finished with loss: 35372.7
Training of batch 4 finished with loss: 30955.9
Training of batch 5 finished with loss: 30645.5
Training of batch 6 finished with loss: 39069.4
Training of batch 7 finished with loss: 25181.5
Training of batch 8 finished with loss: 28106.7
Training of batch 9 finished with loss: 12969.1
Training of batch 10 finished with loss: 3079.69
Training of batch 11 finished with loss: 3693.12
Training of batch 12 finished with loss: 3314.77
Training of batch 13 finished with loss: 4591.12
Training of batch 14 finished with loss: 5880.76
Training of batch 15 finished with loss: 5654.75
Training of batch 16 finished with loss: 10133.1
Training of batch 17 finished with loss: 9301.94
Training of batch 18 finished with loss: 11654.5
Training of batch 19 finished with loss: 11827.8
Training of batch 20 finished with loss: 22028.1
Training of batch 21 finished with loss: 8553.58

About this issue

Original URL
State: open
Created a year ago
Comments: 16

Most upvoted comments

It’s hard to say w/o more context whether the ML model having a loss of thousands is necessarily incorrect

Do you even know how cross entropy loss is calculated, the math behind? Are you aware how big mistakes model should make to get a loss into thousands, how much the model should diverge instead of giving 87% accuracy on the whole test set? Have you seen a loss larger than two digits in successfull training of any ML model known to mankind? ce loss values

That aside, you don’t find it suspicious that loss is so much smaller just in the first batch?

The type of model, the type of loss, the size of the data, the range of the data, number of classes (and probably more) all have an impact.

Have you heard of MNIST dataset? Are you just ignoring the fact that this is classic MNIST dataset and not some random dataset, that the model is a simple one layer neural network used in all of tensorflow examples with pretty much known expected loss/accuracy results? I’ve purposely used the simplest model here for the ease of demonstration, but you constantly pretend like we are talking about training GPT… You say “It’s hard to say w/o more context” when I’ve given you literally all the context possible, but you still talk in hypotheticals.

Is the accuracy you are quoting overall accuracy or specific batch accuracy?

Accuracy on a whole separate dataset used for testing (MNIST test set). Can you explain how the model can achieve 87% accuracy on a test set but have four digits loss on a training set?

Additionally are you testing them side by side? (I’m guessing yes but I have to make sure) or are you testing/training the python model, converting, then testing/training the C++ model?

I am creating empty model, converting it to tflite, and then I am training it separately side by side in both python and c++, on the same dataset, calling same signature functions. Of course, I am training python model with python and tflite model with c++ tflite api, because how would I otherwise do it - you can’t train tflite model in python, and I know you will now say “got ya, those are different models!”, but more on that below.

So the models are not necessarily exactly the same, conversion actually performs some optimizations and changes (such as op fuses).

Yes I am aware of that, I’ve worked on a couple of ML tools implementations, and I know it should not give exactly the same results, but they should be close to same not differing 1000x times for Christ’s sake. I still cannot believe what I am reading, how can someone so confidently ignore the obvious… You gotta be just waiting for me to give up so you can close this issue. The scale of difference in loss values between python and c++ tflite doesn’t bother you at all? How do you even test the c++ tflite implementation, what are you comparing it with if not python tensorflow results?

peratrepic on Jul 7, 2023