keras: Writing a metric is not easy.

Recently some friend asked me to write a MAP(mean average precision) metric. Some sorting operation is required in this metric.

At the beginning, I plan to use K.get_value() to get the value in y_pred and y_true, then use sort methods provided by numpy or python to compute the metric value just like this:

def MAP(y_true, y_pred):

    np_y_true = K.get_value(y_true)
    np_y_pred = K.get_value(y_pred)

    zipped = zip(np_y_true, np_y_pred)
    zipped.sort(key=lambda x:x[1],reverse=True)

    np_y_true, np_y_pred = zip(*zipped)
    k_list = [i for i in range(len(np_y_true)) if int(np_y_true[i])==1]
    score = 0.
    r = np.sum(np_y_true).astype(np.int64)
    for k in k_list:
        Yk = np.sum(np_y_true[:k+1])
        score += Yk/(k+1)

    score/=r

    return K.variable(score)

However, this function doesn’t work. Because a metric is a part of the computation graph, which means it must be a pure “tensor operation”. I cannot get the value of y_true and y_pred because they are “empty” at this moment. Only when the real data was fed into the input tensor of the whole computation graph can we obtain the value of these tensors.

I think perhaps it’s not good for metric function, because a metric function is actually not a part of the model, they are just there for evaluation, not for generating error or gradients or something that are important to the training process.

There are a lot of various metrics and it’s not a easy job for users to define such metrics with the pure tensor language. Perhaps we should rethink the implementation of metrics, decouple the metric from the computation graph, make it possible to use numpy/python to finish such tasks.

How do you think? @fchollet

BTW, finally I wrote a callback function and call model.predict in the callback for computing MAP. It works well, but this method can just work for a fixed validation data and cannot get the metric for training data. If more information is provided in Callback, perhaps it’s a possible solution. And here is the code:

class MAP_eval(Callback):
    def __init__(self, validation_data):
        self.validation_data = validation_data
        self.maps = []

    def eval_map(self):
        x_val, y_true = self.validation_data
        y_pred = self.model.predict(x_val)
        y_pred = list(np.squeeze(y_pred))
        zipped = zip(y_true, y_pred)
        zipped.sort(key=lambda x:x[1],reverse=True)

        y_true, y_pred = zip(*zipped)
        k_list = [i for i in range(len(y_true)) if int(y_true[i])==1]
        score = 0.
        r = np.sum(y_true).astype(np.int64)
        for k in k_list:
            Yk = np.sum(y_true[:k+1])
            score += Yk/(k+1)
        score/=r
        return score

    def on_epoch_end(self, epoch, logs={}):
        score = self.eval_map()
        print "MAP for epoch %d is %f"%(epoch, score)
        self.maps.append(score)

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 45
  • Comments: 30 (14 by maintainers)

Most upvoted comments

Maybe we could add something like model.add_metric(tensor) like we do with losses. Thoughts?

Bringing this up again now that some of the metrics were removed from the latest release. It would be really great if we can come up with a good callback class that enables evaluation outside of the computation graph. Anyone already wrote something like this?

If you want to use numpy metrics, you don’t really need keras-level support. You can just have your training process dump saved models, then have a different eval process load them, cal predict on your validation data, and call a numpy metric on the output. Or any other similar workflow.

On 14 December 2017 at 14:40, Zvezdin notifications@github.com wrote:

Wasn’t the whole point of the issue about detaching the custom metrics from the graph and hence be able to use y_pred and y_true as numpy arrays to have the freedom of and compatibility to the numpy ecosystem?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/4506#issuecomment-351858315, or mute the thread https://github.com/notifications/unsubscribe-auth/AArWbz3MqadjR4D4AeNWZb1a91XD7-Yrks5tAaP7gaJpZM4K8hYo .

Calling predict is a very expensive operation that should not have to be performed again—even just one extra time—in order to calculate metrics when predictions have already been made and are just inaccessible for architectural reasons.

For my use case an extra call to predict is not feasible because the data are ginormous and the additional predict would bloat train time considerably.

I think a good solution would be to expose the latest predictions from the model somehow obtained during training. If that were accessible from the model interface, then you could write a callback to perform whatever “fancy” (non-graph, Python-based) metrics calculations without having to perform expensive predict.

Maybe if I can figure out how to do this I will submit a PR but I can’t imagine the work would be that difficult to expose that…

Any update on this ? I would love a Callback where you have access to the predictions. Having access to the predictions already made (train or validation) would avoid to recompute them with .predict which can be very costly like @evictor pointed out.

Like many others, I would also love a way to write metrics “outside of the graph” by taking in input numpy ndarrays and not Tensors.

@cbaziotis Well, if you are using fit_generator or using validation_split,it dosen’t work. I mean, the callback should be able to access the training samples in each forward process, not pass them by hand. To do this, the code in _fit_loop, the class Callback should be modified properly, and the metrics part in .compile should be removed.

If @fchollet agree to adjust the structure, I can help to write some PRs.

Using scikit-learn metrics is a good idea, there are also lots of modules that could be used in Keras, K-fold cross validation for example. Although it will add another dependence to keras, I think it is worth to. However, @fchollet is the boss, it’s up to him.

I wanted to do a similar thing, and i decided to explicitly pass the training and validation sets to my Metrics Callback. I don’t like it but it works for me.

MetricsCallback.py

class MetricsCallback(Callback):
    def __init__(self, metrics, validation_data, test_data):
        super().__init__()
        self.validation_data = validation_data
        self.test_data = test_data
        self.metrics = metrics
        ...

and then do

metrics_callback = MetricsCallback(validation_data=(X_val, y_val),
                                   test_data=(X_test, y_test),
                                   metrics=["macro_f1", "macro_recall"])

nn_model.fit(X_train, y_train, validation_data=(X_val, y_val),
             nb_epoch=150, batch_size=128,
             callbacks=[metrics_callback])

Wasn’t the whole point of the issue about detaching the custom metrics from the graph and hence be able to use y_pred and y_true as numpy arrays to have the freedom of and compatibility to the numpy ecosystem?

@lebavarois well, this is the metric given by some data mining compition documents. It is a binary classification problem here~

I am really appreciate your response, it’s very clear and helpful. Thank you very much for the explaination.

Perhaps we’re a little away from subject. let’s come back to the main topic~ my point is, although we can implement our own metrics by defining a callback, we needn’t to do this. Here are the reasons:

  • Metrics were set for evaluating the performance of our model, not producing gradients, so it needn’t to be a part of the computation map, that’s why we can remove it from the compile process.

  • Removing metrics from the computation map will make it easier for users to define their metrics with more complicated logic, or to use the functions provided by other packages such as scikit-learn, that’s why we want to do this.

Please feel free to correct me if I was wrong.

I would like to use opencv in my metrics and track them while I run the graph. It is unnecessary to force users to use callbacks for this. There is no benefit at all to have metrics in the graph. Usually python is just fine and fast enough for metrics.

While this PR might help for some problems, I think fredtcaroli’s suggestion might work best for a larger set. I also needed to calc some custom accuracy during training but it also needs extra informartion. (not the tensors except pred/true but like image_ocr example, some dictionary for decode purposes) Sending all that information to tf session is kinda pointless. And making really hard for beginners.

Though I am a beginner, evaluating metrics should be out of graph execution might be best choice. Edit: at least optionally, its better to keep those in graph for computation costs obviously.

And if extra tensors should be needed to evaluate such a metric, it should be mapped so that keras could extract the value out of session and send them whatever metric evaluation.

Btw, i had across several questions/problems while searching, most solutions are just a workaround for a specific problem and not applicable to others. So this issue might need a little more attention for the design part.

Well… If we had something like model.add_metric(tensor) then tensorflow.py_func would be an option for folks wanting to use some third-party libs to compute metrics

It’s kinda limited, but should cover 90% of the use cases, I guess

I still think that unnecessarily exposing the last batch predictions is bad, but that’s only my 2 cents

I also would like to be able to implement that kind of custom metrics at the expense of a bit of performance. I don’t see how the workaround by @cbaziotis solves the issue, because the custom callback never receives y_pred, it only keeps references of the network input data and expected output.

@fredtcaroli you can just append the metric name to self.params['metrics'] and it will show in the progbar: self.params['metrics'].append('my_custom_metric')

@lebavarois I see, yes, it would be faster. I know theano.tensor.sort, it’s…hard to understand and harder to verify if you’re using it correctly.

Anyway, speed is a reasonable reason. Although compared with the training cost, I don’t think the evaluation process is very time-consuming. And in my opinion, feasibility is more important than efficiency, in addtion, computation speed has never been an advantage for keras.

Assume considering the need of evaluation efficiency, we want to retain the current implementation, I still think at least we should add another callback class(Just use scikit-learn metrics), so that the users would have a possibility to choose. If the metric is very complicated and we don’t really care the training time, they could choose the callbacks version.

@MoyanZitto I found out why the values did not match in my previous post.

  1. I got the wrong value in python2, for python3 it is correct. score += Yk/(k+1) should be score += Yk/float(k+1) in python2.

  2. There seems to be a problem in the latest sklearn version. With the code of this pull request, I get the same value as in your code.

import numpy as np
from sklearn.metrics import average_precision_score

y_true = np.array([0, 0, 1, 1])
y_pred = np.array([0.1, 0.4, 0.35, 0.8])

def AP(np_y_true, np_y_pred):

    zipped = zip(np_y_true, np_y_pred)
    zipped.sort(key=lambda x:x[1],reverse=True)

    np_y_true, np_y_pred = zip(*zipped)
    k_list = [i for i in range(len(np_y_true)) if int(np_y_true[i])==1]
    score = 0.
    r = np.sum(np_y_true).astype(np.int64)
    for k in k_list:
        Yk = np.sum(np_y_true[:k+1])
        score += Yk/float(k+1)

    score/=r
    return score


print AP(y_true, y_pred)
#---> 0.8333
print average_precision_score(y_true, y_pred)
#---> 0.8333

Still I think it is AP and not MAP. If you have a multi-label classification (e.g. one image with multiple class-labels), AP should give you the evaluation for just one test datapoint (e.g. one test image). If you then have multiple test datapoints (e.g. multiple images) you can compute the mean for the whole test set, which is then MAP. So in this case y_true and y_pred should be 2 dimensional (multiple outputs) and the AP function should be applied to the second dimension. Finally the mean is taken over the first dimension (which corresponds to the datapoints, e.g. multiple images).

Maybe for other use-cases it makes more sense to implement it the way you did (if the class-label is part of the model-input and you have just one model-output, e.g. recommendation system). What kind of data are you evaluating?