fairlearn: MetricFrame should support metrics that don't require y_true and y_pred

Right now MetricFrame only works with metrics with the signature metric(y_true, y_pred), but its disaggregation functionality should be much more broadly applicable. Two use cases of interest:

  1. Dataset-only metrics or prediction-only metrics. These are the metrics of the form metric(y_true) and metric(y_pred). While in principle these could be handled by the current API, it’s a bit confusing.
  2. Metrics for settings beyond classification / regression. For example, to evaluate a contextual bandit algorithm, the metrics have the signature metric(actions, rewards, propensities). This use case comes up in settings with partial observations, e.g., in lending: we only observe whether the person repays a loan (reward) for the specific loan type we provide (action), including “no loan”.

I think regardless of how the API is tweaked, there’s an obvious first step. And then hopefully a not-too-controversial second step.

Step 1: Make MetricFrame arguments keyword-only

The suggestion is to change the current API, which is this:

MetricFrame(metric, y_true, y_pred, *, sensitive_features, control_features=None, sample_params=None)

To something like this:

MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)

So all arguments would be keyword-only and metric would become metrics (for consistency with other arguments and with columns in pandas). The functionality wouldn’t change.

Step 2: Allow flexible names of shared sample parameters

Change the signature to:

MetricFrame(*, metrics, sensitive_features, control_features=None, sample_params=None, **shared_sample_params)

The idea is that any of the shared_sample_params are passed on to all metrics, whereas sample_params is (as before) a dictionary. With the new signature, it would still be possible to write things like

mf = MetricFrame(metrics=skm.accuracy_score, y_true=y_true, y_pred=y_pred, sensitive_features=sf)

But it would be simple to use other kinds of metrics that do not work with y_true and y_pred.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 74 (71 by maintainers)

Most upvoted comments

Is there resistance even to Step 1, which would just modify the current API to:

MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)

I don’t think that it would make the API more confusing, would it?

I think that there are good use cases that have nothing to do with reinforcement learning and that we have already run into (dataset-only metrics like demographic parity, streaming metrics which are currently being implemented, and word-error rate-like settings which are requested as well). But let me create some short examples to demonstrate.

Note that the metric functions themselves are dispatched using positional arguments - the basic signature is assumed to be metric_fn(y_true, y_pred). Now that we are requiring named arguments, this is an inconsistency.

I personally don’t mind this inconsistency.

As for the rest of the proposal, I find it way too complicated, and not very user friendly, and hard to explain to users.

I would rather settle for a middle ground where we leave MetricFrame unchanged, and have a DatasetMetricFrame which accepts only one y. An API which supports all possible cases is an overly complicated one which is designed to handle 100% of the cases, we could settle for 95% and have a much simpler API, and have the other 5% use-cases do the appropriate workarounds.

We should of course accept extra input required by the metric. For example, in the case of inverse_propensity_score, if I understand correctly, we can have something like:

inverse_propensity_score(
    y_pred=actions_taken,
    y_true=actions_target,
    rewards=rewards,
    propensities=propensities
)

And with SLEP006 sklearn could accept something like:

make_scorer(inverse_propensity_score, score_params=["rewards", "propensities"])

to get a scorer. Note that this has removed the need for us to do any introspection, but we have a scorer object which is a callable but also has a state.

An alternative here, since we don’t actually do slicing to re-call a scorer, is to have an object which only accepts precomputed scores, like

PrecomputedMetricFrame(
    auc=auc_score(...),
    propensity=propensity_score(....),
    ...,
    sensitive_attributes=...
)

Or something like that.

Summary

Per @adrinjalali 's request 😃

I think that we’ve got a few things going on here:

  1. It seems that when users see y_true and y_pred they may think that MetricFrame only works for classification metrics, rather than more generally (really, they are column_from_data and column_from_model and even that meaning is only really imposed by the underlying metric function)
  2. Sometimes the two ‘columns’ might actually need to be multiple columns
  3. Some metrics only take one of these (e.g. selection_rate()) and users might not want to write a wrapper for this

@kstohr has also suggested that the ‘split up by sensitive and conditional features’ part of MetricFrame may well work better as a separate piece of functionality.

Re. static initializer from_frame(): That works for me. I like it. However, some comments:

  • What do we do about the streaming API here? I think that this is actually a very nice use case for streaming metrics, because aggregation is super easy, but it would lead to a more complicated add_data(). This makes me think that rather than a static initializer, we should actually have a subclass MetricFrameFromValues or something like that, which would have a different constructor and a different add_data(). I’m generally not a big fan of subclassing in Python, but this might be a perfect use case.
  • Re. @hildeweerts question, why would a data scientist want this. I can think of the following:
    • There’s some extra functionality: difference(), ratio(), etc., and (hopefully soon), we will have streaming, and possibly error bars.
    • If they want to build on some of our examples that use MetricFrame, this would be an easy drop-in option.
    • This feature is actually being requested by a team that’s looking to use Fairlearn, so there are some data scientists that are interested 😃

@riedgar-ms : one of the reason to have keyword-only parameters is so that you can skip them when they’re not needed. For example, if we went with the signature:

MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)

Then it would be okay to drop y_true and y_pred and provide all the metric parameters via sample_params.

If the use case is “we can drop the current parameters, and have a different set of parameters instead” then we should probably be creating a new class to handle the second set of parameters. Right now, I feel that MetricFrame does the fairly simple task of taking an sklearn-style (y_true, y_pred) metric, and turning it into one with grouping (via the sensitive features). I’d prefer to keep it that way, and work out the best API for a new use case from a clean sheet.

It may be that there is a more general API hiding within, and that we end up changing MetricFrame to wrap (or subclass) that. But I think it would still be better to keep the existing MetricFrame API around, doing its simple job.

I don’t have a lot to add to this discussion, but I agree that leaving MetricFrame as is and adding an alternative more flexible version would be preferable.

pinging @fairlearn/fairlearn-maintainers to move the discussion forward

@hildeweerts that sort of leads to the question “Then what is y as opposed to y_true and y_pred?”. Scanning through the link to sklearn it appears that it’s used for cases where the ground truth isn’t known? If so, then would you always have either y_true and y_pred (s/y/labels/ as appropriate) OR just y ?

I’m not sure if I understand your question correctly so sorry in advance if this is all obvious to you. But generally y is used for ground-truth during training, my guess is that the authors decided to not call it y_true here because there is no explicit counterpart y_pred. For evaluation we need to be more specific, hence y_true and y_pred (or y_score). Similarly, they use labels to refer to cluster labels and only append _true and _pred in cases where there is an explicit ground-truth.

The factory function would be something like:

def expand_y_pred(y_true, y_pred, f):
    assert len(y_true) == len(y_pred)
    return f(y_pred)

def make_metric_from_y_pred_function(function):
    return functools.partial(expand_y_pred, f = function)

in which case we could define:

selection_rate = make_metric_from_y_pred_function(np.mean)

(note I have only written the code above, not actually tried it). There could be a similar make_metric_from_y_true_function() as well.

Does that make more sense @hildeweerts ?

We’ve got two things going on here:

  • Where we have a precomputed metric where we just need to slice it up by the sensitive features
  • Where there are multiple ‘columns’ in y_pred and/or y_true

I definitely agree that not adding more arguments to the constructor is preferable - the static from_frame() approach @romanlutz suggests would be better if there’s a precomputed metric to be sliced.

For the second point, I can see that supporting multicolumn DataFrame arguments for both could be useful. Would that do enough to support @michaelamoako 's use cases?

@MiroDudik I like this, but rather than adding lots and lots of possible arguments to MetricFrame that only really work for some precomputed use cases, I would suggest having a function, let’s call it from_frame where you specify the columns

df = DataFrame({'metric1': [0.1, 0.2, 0.3, 0.1], 'metric2': [1, 0, 0.5, 0.5], 'sf': ['a', 'b', 'a', 'b']}) 
mf = MetricFrame.from_frame(metric_value_cols=['metric1', 'metric2'], sensitive_feature_cols=['sf'])

As a user I certainly would be confused about which args to provide if we keep on adding more and more. It’s already quite a few that our average user probably doesn’t care about (such as control features).

@riedgar-ms : what do you think about Alternative A? I think it would go a long way towards addressing the main issues and it doesn’t introduce **kwargs.

@hildeweerts : correct, our current functionality is supported by both Alternative A and Alternative B by just requiring to name y_true, y_pred, and metrics.

I know… examples… soonish?

@riedgar-ms : re. swift release, my suggestion re. staging is

1. backward compatible implementation with deprecation warning:

MetricFrame(*args, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)

2. move to keyword-only arguments:

MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)

3. addition of any other arguments like shared_sample_params (but that could happen later or together with 2)

Examples are coming next week… promise!

It may be that there is a more general API hiding within, and that we end up changing MetricFrame to wrap (or subclass) that. But I think it would still be better to keep the existing MetricFrame API around, doing its simple job.

I was thinking the same!

I think the majority of users will be looking for classification/regression metrics and may not even be familiar with reinforcement learning. Having to look at examples to understand how to use an API even in the ‘simplest’ scenario, signals that it is not intuitive.

I like @romanlutz idea of writing out some examples!

I feel like this discussion could benefit from some concrete examples. Like @riedgar-ms (and @hildeweerts I think) I’m currently struggling to see the benefit of complicating the currently very intuitive API, but perhaps with 3-5 examples of what this would look like for actual metrics (say, one of the existing ones we support, and then whichever new ones we might be able to accommodate through this) we might see the benefits (?)

@riedgar-ms : one of the reason to have keyword-only parameters is so that you can skip them when they’re not needed. For example, if we went with the signature:

MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)

Then it would be okay to drop y_true and y_pred and provide all the metric parameters via sample_params.

I think that there are a few different things going on here, and I’d like to separate the different threads.

Firstly, I think that the idea behind shared_sample_params is a good one. Indeed, I considered putting something like that in last autumn when I was writing MetricFrame. I didn’t because it wasn’t a ‘must have’ feature. However, as I said above, I think it should be a Dict[str,vector] and not **kwargs. Going with the latter makes it harder for us to add any other named arguments of our own (since whatever name we pick, we’re bound to break someone) at a later date. We ‘explode’ the Dict when the metric functions themselves are invoked.

Secondly, I certainly don’t think that MetricFrame is the last word on these sort of metrics. Indeed, I’m sure that areas like text or vision may well need similar but different functionality (I don’t know enough about them to say). But like @hildeweerts , I think that it may well be better to take each by itself, and figure out what works best for that domain, rather than trying to shoehorn them into MetricFrame.

I can see the logic in renaming the metric argument to metrics (it is a potential break, but only if people are naming their arguments). I’m not sure I understand the motivation behind forcing named arguments. I think that would make the interface much less skearn-like.

I’m not terribly opinionated on this issue, but if we move to something like MetricFrame(*, metrics, sensitive_features, control_features, sample_params, **shared_sample_params) then we better have extremely good documentation. Reading this in an API documentation I would have no clue what to do with that. In contrast, the current signature is very intuitive to me. Obviously that can be fixed with a couple of nice examples right in the API doc and updating the user guide, but I want to make sure this is on our radar (i.e., no code PR without updating the docs).

I would also suggest providing a few examples of what that would look like if we have different kinds of metrics (e.g. accuracy needs y_true and y_pred, selection rate doesn’t need a second set of ys) in the same set of metrics passed in.

Finally, I’m wondering whether the right thing to do here would be to add a warning that MetricFrame will have keyword-only args starting with the next version so people should adjust their scripts. [If so, we should make the switch to metrics instead of metric right away, otherwise it breaks people again.]

Re. metric -> metrics–the rationale is that we have sensitive_features and control_features. I’ve also noticed that in the tutorials I keep writing things like:

# Define metrics
metrics = {
    'accuracy': skm.accuracy_score,
    'precision': skm.precision_score,
    'recall': skm.recall_score,
    'false_positive_rate': flm.false_positive_rate,
}

mf = MetricFrame(metrics, ...)

So it seems that metrics is the more appropriate name–it makes no sense to change the name in isolation, but if we decide to implement (1), then it would be an opportunity to change this as well.

@riedgar-ms : the transitional API (i.e., for the purposes of deprecating the old API) would be something like: MetricFrame(*args, metrics, y_true, y_pred, sensitive_features, control_features, sample_params)