keras: Using outputs with missing values

I’m trying to train a multi-task regression model, but my outputs are not complete (in fact I only have on average <1% of the values per training instance). I expected mean squared error for the non-null outputs to be a reasonable objective, however, obviously using keras’ mean squared error objective, the cost comes out as nan, as the nans will propagate.

Is there any plans for supporting this sort of thing (or is it already supported somehow and I missed it?)

If not, anyone have an idea of a hack? I tried writing a new cost function, like:

def mean_squared_error(y_true, y_pred):
    return K.mean(K.square(y_pred - y_true)[(1 - T.isnan(y_true)).nonzero()], axis=-1)

(sorry for mixing keras and theano!)

This evaluates correctly on 1D vectors with nans in the y_true, but this doesn’t work with keras, even with batch size set to 1. My next plan was to set the NaNs in y_true to equal y_pred, but I’m not experienced with theano.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 15 (4 by maintainers)

Most upvoted comments

Did any one try to use something like this:

def custom_error_function(y_true, y_pred):
    bool_finite = T.is_finite(y_true)
    return K.mean(K.square(T.boolean_mask(y_pred, bool_finite) - T.boolean_mask(y_true, bool_finite)), axis=-1)

It works for me on a small test dataset with np.nan values in y_true.

Hey @cpury , sorry for not replying until now!

In the end I did get this working with TF (back in 2016!). I don’t have access to the code right now, but I think it worked mostly like above:

>>> import tensorflow as tf
>>> from numpy import array, nan

>>> def mse_mv(y_true, y_pred):
...     per_instance = tf.where(tf.is_nan(y_true),
...                             tf.zeros_like(y_true),
...                             tf.square(tf.subtract(y_pred, y_true)))
...     return tf.reduce_mean(per_instance, axis=0)

>>> y_true = array([[ 1.,  2.],
...                 [ 2.,  3.],
...                 [nan,  4.],
...                 [ 5.,  6.]])

>>> y_pred = array([[ 1.,  2.],
...                 [ 2.,  4.],
...                 [ 42,  4.],
...                  [ 3.,  7.]])

>>> loss = mse_mv(y_true, y_pred)
>>> with tf.Session().as_default():
...    loss.eval()
array([1., 0.5.])

Dunno if this will work with keras but I guess so.

Edit: fixed it to me mean squared

Hi @lewisacidic I have two questions regarding your answer: when you replace your NANs by 0, in the case that 0 has a meaning like other values (for instance in climatic time series), is it still correct to do so ? My second question is that when we average the new sum of squared, shouldn’t we divide by number of non-NANs? Many thanks!