fairlearn: MetricFrame should support metrics that don't require y_true and y_pred
Right now MetricFrame
only works with metrics with the signature metric(y_true, y_pred)
, but its disaggregation functionality should be much more broadly applicable. Two use cases of interest:
- Dataset-only metrics or prediction-only metrics. These are the metrics of the form
metric(y_true)
andmetric(y_pred)
. While in principle these could be handled by the current API, it’s a bit confusing. - Metrics for settings beyond classification / regression. For example, to evaluate a contextual bandit algorithm, the metrics have the signature
metric(actions, rewards, propensities)
. This use case comes up in settings with partial observations, e.g., in lending: we only observe whether the person repays a loan (reward) for the specific loan type we provide (action), including “no loan”.
I think regardless of how the API is tweaked, there’s an obvious first step. And then hopefully a not-too-controversial second step.
Step 1: Make MetricFrame arguments keyword-only
The suggestion is to change the current API, which is this:
MetricFrame(metric, y_true, y_pred, *, sensitive_features, control_features=None, sample_params=None)
To something like this:
MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)
So all arguments would be keyword-only and metric
would become metrics
(for consistency with other arguments and with columns
in pandas). The functionality wouldn’t change.
Step 2: Allow flexible names of shared sample parameters
Change the signature to:
MetricFrame(*, metrics, sensitive_features, control_features=None, sample_params=None, **shared_sample_params)
The idea is that any of the shared_sample_params
are passed on to all metrics, whereas sample_params
is (as before) a dictionary. With the new signature, it would still be possible to write things like
mf = MetricFrame(metrics=skm.accuracy_score, y_true=y_true, y_pred=y_pred, sensitive_features=sf)
But it would be simple to use other kinds of metrics that do not work with y_true
and y_pred
.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 1
- Comments: 74 (71 by maintainers)
Is there resistance even to Step 1, which would just modify the current API to:
I don’t think that it would make the API more confusing, would it?
I think that there are good use cases that have nothing to do with reinforcement learning and that we have already run into (dataset-only metrics like demographic parity, streaming metrics which are currently being implemented, and word-error rate-like settings which are requested as well). But let me create some short examples to demonstrate.
I personally don’t mind this inconsistency.
As for the rest of the proposal, I find it way too complicated, and not very user friendly, and hard to explain to users.
I would rather settle for a middle ground where we leave
MetricFrame
unchanged, and have aDatasetMetricFrame
which accepts only oney
. An API which supports all possible cases is an overly complicated one which is designed to handle 100% of the cases, we could settle for 95% and have a much simpler API, and have the other 5% use-cases do the appropriate workarounds.We should of course accept extra input required by the metric. For example, in the case of
inverse_propensity_score
, if I understand correctly, we can have something like:And with SLEP006 sklearn could accept something like:
to get a scorer. Note that this has removed the need for us to do any introspection, but we have a scorer object which is a callable but also has a state.
An alternative here, since we don’t actually do slicing to re-call a scorer, is to have an object which only accepts precomputed scores, like
Or something like that.
Summary
Per @adrinjalali 's request 😃
I think that we’ve got a few things going on here:
y_true
andy_pred
they may think thatMetricFrame
only works for classification metrics, rather than more generally (really, they arecolumn_from_data
andcolumn_from_model
and even that meaning is only really imposed by the underlying metric function)selection_rate()
) and users might not want to write a wrapper for this@kstohr has also suggested that the ‘split up by sensitive and conditional features’ part of
MetricFrame
may well work better as a separate piece of functionality.Re. static initializer
from_frame()
: That works for me. I like it. However, some comments:add_data()
. This makes me think that rather than a static initializer, we should actually have a subclassMetricFrameFromValues
or something like that, which would have a different constructor and a differentadd_data()
. I’m generally not a big fan of subclassing in Python, but this might be a perfect use case.MetricFrame
, this would be an easy drop-in option.If the use case is “we can drop the current parameters, and have a different set of parameters instead” then we should probably be creating a new class to handle the second set of parameters. Right now, I feel that
MetricFrame
does the fairly simple task of taking ansklearn
-style(y_true, y_pred)
metric, and turning it into one with grouping (via the sensitive features). I’d prefer to keep it that way, and work out the best API for a new use case from a clean sheet.It may be that there is a more general API hiding within, and that we end up changing
MetricFrame
to wrap (or subclass) that. But I think it would still be better to keep the existingMetricFrame
API around, doing its simple job.I don’t have a lot to add to this discussion, but I agree that leaving
MetricFrame
as is and adding an alternative more flexible version would be preferable.pinging @fairlearn/fairlearn-maintainers to move the discussion forward
I’m not sure if I understand your question correctly so sorry in advance if this is all obvious to you. But generally
y
is used for ground-truth during training, my guess is that the authors decided to not call ity_true
here because there is no explicit counterparty_pred
. For evaluation we need to be more specific, hencey_true
andy_pred
(ory_score
). Similarly, they uselabels
to refer to cluster labels and only append_true
and_pred
in cases where there is an explicit ground-truth.The factory function would be something like:
in which case we could define:
(note I have only written the code above, not actually tried it). There could be a similar
make_metric_from_y_true_function()
as well.Does that make more sense @hildeweerts ?
We’ve got two things going on here:
y_pred
and/ory_true
I definitely agree that not adding more arguments to the constructor is preferable - the static
from_frame()
approach @romanlutz suggests would be better if there’s a precomputed metric to be sliced.For the second point, I can see that supporting multicolumn DataFrame arguments for both could be useful. Would that do enough to support @michaelamoako 's use cases?
@MiroDudik I like this, but rather than adding lots and lots of possible arguments to
MetricFrame
that only really work for some precomputed use cases, I would suggest having a function, let’s call itfrom_frame
where you specify the columnsAs a user I certainly would be confused about which args to provide if we keep on adding more and more. It’s already quite a few that our average user probably doesn’t care about (such as control features).
@riedgar-ms : what do you think about Alternative A? I think it would go a long way towards addressing the main issues and it doesn’t introduce
**kwargs
.@hildeweerts : correct, our current functionality is supported by both Alternative A and Alternative B by just requiring to name
y_true
,y_pred
, andmetrics
.I know… examples… soonish?
@riedgar-ms : re. swift release, my suggestion re. staging is
1. backward compatible implementation with deprecation warning:
2. move to keyword-only arguments:
3. addition of any other arguments like
shared_sample_params
(but that could happen later or together with 2)Examples are coming next week… promise!
I was thinking the same!
I think the majority of users will be looking for classification/regression metrics and may not even be familiar with reinforcement learning. Having to look at examples to understand how to use an API even in the ‘simplest’ scenario, signals that it is not intuitive.
I like @romanlutz idea of writing out some examples!
I feel like this discussion could benefit from some concrete examples. Like @riedgar-ms (and @hildeweerts I think) I’m currently struggling to see the benefit of complicating the currently very intuitive API, but perhaps with 3-5 examples of what this would look like for actual metrics (say, one of the existing ones we support, and then whichever new ones we might be able to accommodate through this) we might see the benefits (?)
@riedgar-ms : one of the reason to have keyword-only parameters is so that you can skip them when they’re not needed. For example, if we went with the signature:
Then it would be okay to drop
y_true
andy_pred
and provide all the metric parameters viasample_params
.I think that there are a few different things going on here, and I’d like to separate the different threads.
Firstly, I think that the idea behind
shared_sample_params
is a good one. Indeed, I considered putting something like that in last autumn when I was writingMetricFrame
. I didn’t because it wasn’t a ‘must have’ feature. However, as I said above, I think it should be aDict[str,vector]
and not**kwargs
. Going with the latter makes it harder for us to add any other named arguments of our own (since whatever name we pick, we’re bound to break someone) at a later date. We ‘explode’ theDict
when the metric functions themselves are invoked.Secondly, I certainly don’t think that
MetricFrame
is the last word on these sort of metrics. Indeed, I’m sure that areas like text or vision may well need similar but different functionality (I don’t know enough about them to say). But like @hildeweerts , I think that it may well be better to take each by itself, and figure out what works best for that domain, rather than trying to shoehorn them intoMetricFrame
.I can see the logic in renaming the
metric
argument tometrics
(it is a potential break, but only if people are naming their arguments). I’m not sure I understand the motivation behind forcing named arguments. I think that would make the interface much lessskearn
-like.I’m not terribly opinionated on this issue, but if we move to something like
MetricFrame(*, metrics, sensitive_features, control_features, sample_params, **shared_sample_params)
then we better have extremely good documentation. Reading this in an API documentation I would have no clue what to do with that. In contrast, the current signature is very intuitive to me. Obviously that can be fixed with a couple of nice examples right in the API doc and updating the user guide, but I want to make sure this is on our radar (i.e., no code PR without updating the docs).I would also suggest providing a few examples of what that would look like if we have different kinds of metrics (e.g. accuracy needs
y_true
andy_pred
, selection rate doesn’t need a second set ofy
s) in the same set ofmetrics
passed in.Finally, I’m wondering whether the right thing to do here would be to add a warning that
MetricFrame
will have keyword-only args starting with the next version so people should adjust their scripts. [If so, we should make the switch tometrics
instead ofmetric
right away, otherwise it breaks people again.]Re.
metric
->metrics
–the rationale is that we havesensitive_features
andcontrol_features
. I’ve also noticed that in the tutorials I keep writing things like:So it seems that
metrics
is the more appropriate name–it makes no sense to change the name in isolation, but if we decide to implement (1), then it would be an opportunity to change this as well.@riedgar-ms : the transitional API (i.e., for the purposes of deprecating the old API) would be something like:
MetricFrame(*args, metrics, y_true, y_pred, sensitive_features, control_features, sample_params)