scikit-learn: RFC: Breaking Changes for Version 2

A while ago we talked about the possibility of a version 2 with breaking changes. Specifically, this came up in the context of SLEP006: metadata routing, but there are other things we have wanted to break which we can do all in a single release. In this issue we can talk about the possibility of such a release, and maybe a few things we’d like to include in it.

I’ll try to summarize the challenges of keeping backward compatibility (BC) in the context of SLEP006:

  • MetaEstimator(Estimator()).fit(X, y, sample_weight) will raise according to SLEP006 since sample_weight is not requested by any object. For backward compatibility, during the transition period, it’ll only warn and assume sample_weight is requested by Estimator().
  • What should the behavior of GridSearchCV(Estimator(), scorer=scorer).fit(X, y, sample_weight) be during the transition period? If we keep backward compatibility, it should warn (whatever the warning be), route sample_weight to Estimator, but not to the scorer. And that’s only the case because we’re keeping BC. From the user’s perspective, it’s very weird.
  • Pipeline ATM in certain methods like fit, fit_transform, predict, routes *_params to the fit/etc of the last estimator only. And decision_function and transform, score_samples have no routing at all. Keeping this BC, proves to be challenging, and a bit nasty. ref: #24270
  • In quite a few meta-estimators, we check if the sub-estimator has sample_weight in the signature of fit, and we pass the given sample weight to it if that’s the case, and not if that’s not the case. With routing, we would have a much better idea of when to pass sample weight and when not, but if we’re keeping BC, it’s really challenging to see if we should forward sample weights or not. Both AdaBoost (#24026) and Bagging (#24250) have been very challenging, to the point that we might end up having to check if a sub-estimator is a meta-estimator itself or not, and check signature of sub-estimator’s fit, and check the given routing info to keep BC.

I’m pretty sure there are other issues we haven’t foreseen, and they’ll come up as we work on this project. We thought it’d be easy once we get to this point, but BC has proven to make it very tricky and has stalled the progress.

All of the above is adding substantial complexity which is probably error prone, and buggy. We have had to spend quite a bit of time every time we go back to review those bits to understand why we’ve done what we’ve done. Even as a temporary solution they’re not very maintainable.

My proposal here is to have a version 2 which is not BC with previous releases in certain ways, and target a few breaking changes for it. We can aim to have the release in October 2023 or April/May 2024 as the version 2.

cc @scikit-learn/core-devs

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 35 (35 by maintainers)

Most upvoted comments

@betatim What I learned is that I should go in the mountains with you 🗻

@adrinjalali To better weigh the consequences of a decision, could you summarize (or point to if already existing) what a breaking change for SLEP006 will actually break, within scikit-learn users and for 3rd party library developers?

from sklearn.experimental import sleep6

I really think y’all are underestimating the complexity this brings.

I’d put it the other way around: I might be overestimating the complexity of the change, and as a result thinking that doing something complicated like modifying code through an import statement is “a good trade-off”. The reason I am thinking that this change is a complex one, both in terms of code and for users, is that I’ve read the SLEP, some of the discussions, peaked at the related PRs and left confused/thinking “not sure I can wrap my head around this”.

Also, I do understand the challenges for the users when it comes to breaking changes, but I don’t really see a feasible way forward at this point which doesn’t involve that, and this is keeping us behind. I like @lorentzenchr 's expression of “5 years too late”.

On a personal level, I’ve been working on this topic for over 4 years now, and I have been under the impression that we care about this. and I think it’d be a pity to shelf this feature.

It is hard, especially because so much work and thought has gone into it. For me as a bystander it is amazing for how long this work was under discussion and then how long people have worked on it. It also means that as an outsider I look at it as a project with a huge cost and unclear benefits. But I think a large part of the unclearness is because I’ve not been involved a lot. There are also plenty of things that I’ve bet against in the past that turned out to be winners, so what do I know about the future.

The reason I wanted to explicitly bring up the idea of “stop working on this” is that I see the same warning signs I’ve learnt to look for when out and about in the mountains. They include things like “we’ve invested so much” (e.g. we spent a day travelling here, then paid for a hotel, permits, …), “we are nearly there, so let’s push” (we’ve done 95% of the ascent and there are only 50 meters left), “let’s make an exception just this one time” (don’t ski slopes over 30 degrees when the avalanche risk is Considerable, but today is different!), lots of people in the group going quiet (“I had a bad feeling, but no one else said anything, so I didn’t say anything either”), etc. In a lot of cases where people end up in trouble, it is a factor like those that lead to it. The only way I know to help deal with this “get-there-itis” is that you explicitly talk about it. Even when you don’t think you are in a situation where you are effected by it. To make it a normal thing that you always talk about. People don’t get in trouble 100m from the start, people get in trouble 100m from the top.

So my goal is not to suggest to stop this, but to have it as an explicit option that is discussed and then decided upon.

With all the points raised here having their own separate issue, and us not doing a breaking release now, I think we can close this one. Thanks everybody for the fruitful discussions.

For maintainers’ information and as previously mentioned to @adrinjalali, I prefer abstaining from participating in discussions on SLEP006: I am unaware of past discussions’ conclusions, and I think there are already enough and better-informed maintainers participating.

I trust you to make the best decisions.

A little add-on to @adrinjalali summary above. We agree on:

  • The current state of routing, in particular for sample_weight is broken. Example: #25906
  • Metadata routing according to SLEP006 will solve that problem in a general way.
  • sample_weight has a special place and needs more attention. According to our tradition and claim, we should provide ways to make simple things simple, e.g. by good defaults, global config options, context managers, etc., etc. and by fixing bugs.

I also believe that if there are sample weights involved then everything in the “chain” (be that a Pipeline or an estimator that does internal CV or an estimator that uses a metric to do early stopping) has to support weights. Here “supports” means “does something sensible when weights are involved”. This sensible thing could be “take them into account when re-sampling samples” (say for some kind of stratified sampler) or “ignore the weights” (say for something like MinMaxScaler).

To me it also feels like a bug that not everything in the chain has to be sample weight aware. I think that if we want to handle sample weigths properly and consistently in scikit-learn, we first need to be clear about what is a sample weight. There are already discussions about that (I can’t retrieve), but the way I see sample weights is that if an observation point has a weight w, then everything should act as if there were w copies of this point in the dataset.

Something to avoid are cases like https://github.com/scikit-learn/scikit-learn/issues/25906. To me this is a bug. If a user passes sample weights in the metric inside LogisticRegressionCV has to support that. Otherwise the user selected an invalid combination and should receive an exception.

Thinking about sample weights like I described above, any step that silently ignores sample weights is a bug in my mind. Thus, even if adding support for sample weights to an estimator changes the behavior when it’s used in a meta estimator, I consider it a bug fix. This is why I tend to think that meta-estimators should raise if any of the step does not support sample weights.

I understand however that metadata routing is not just about sample weights and that all meta-estimators have different ways of doing things for technical reasons, and I haven’t spent enough time on this subject, unlike @adrinjalali 😃, to have a strong opinion on how to design the best API to handle that, so my thought are not very deep.

I can’t make it to the dedicated meeting today (🛩️). After some thinking and discussing I think my main worry is about making the “I don’t know either scikit-learn, please just make it happen, you are the expert after all” user story harder for little gain. To personify the project for a moment: scikit-learn is an expert in machine-learning, people do and should delegate decisions to it. Delegation is convenient and for the vast majority of people the outcomes are better (day trading vs passive index investing, wiring your own house vs a professional electrician doing it, using Python to do science vs C++).

For me this means that as a user I’ve expressed everything that there is to express with

est = AdaBoostClassifier(LogisticRegression())
est.fit(X, y, sample_weight=sw)

Most likely the average user doesn’t even know that there are options and what the trade-offs between them are. As an expert who does know, most of the time I am not using this in “expert mode” either, so it is inconvenient to have to make choices.

I also believe that if there are sample weights involved then everything in the “chain” (be that a Pipeline or an estimator that does internal CV or an estimator that uses a metric to do early stopping) has to support weights. Here “supports” means “does something sensible when weights are involved”. This sensible thing could be “take them into account when re-sampling samples” (say for some kind of stratified sampler) or “ignore the weights” (say for something like MinMaxScaler).

Something to avoid are cases like https://github.com/scikit-learn/scikit-learn/issues/25906. To me this is a bug. If a user passes sample weights in the metric inside LogisticRegressionCV has to support that. Otherwise the user selected an invalid combination and should receive an exception.

This does not mean we should forbid/make it impossible for experts to re-jig things so that they can use LogisticRegressionCV with weights for the estimator but ignore them during metric calculation. However this is the case where extra effort from the user should be required.

I wrote more about this in https://github.com/scikit-learn/scikit-learn/issues/23928#issuecomment-1491443254

It feels like introducing “your chain of tools has to be consistent when it comes to using/not using sample weights” is something that can be done in a backwards incompatible with deprecation cycle way. For additional sample aligned properties like protected group membership I think we have less constraints from current user’s code because AFAIK it is virtually impossible to do this now anyway. This means I am less worried about requiring explicit routing setup for these use-cases. It would still be nice if there was a way to make the “happy path” work without much extra code from the user.

Most users have no sample weights, a few less have sample weights, even fewer have additional sample aligned properties and nearly no users are experts-doing-expert things. The defaults and convenience of using scikit-learn should reflect that (least amount of extra code for no weights, most extra code and “know what you are doing” for the experts doing expert things).

Although one thing we could consider would be if we’re doing a breaking change release, would we break things in a different way for this? As in, we wanted to do a smooth transition kinda solution, but if we are breaking things anyway, would we do a different thing? @jnothman reminded me of this.

Do we need a SLEP for making a BaseEstimator / scikit-learn-core package?

I prefer not to introduce multiple breaking changes in a single release. This would make it much harder to update to v2. If we need to release a breaking v2, I prefer to only break for SLEP006 and nothing else. From reviewing the PRs in SLEP006, I agree it is much clearer to release SLEP006 with breaking changes.

Assuming we adopt SLEP006 with breaking changes in v2, my remaining concerns are:

  1. Silently changing behavior. Are there cases where existing code works, but will have new behavior under SLEP006 without raising an error?
  2. Migration guide for third party estimator developers. Given that v2 is a breaking change, it will take a little longer for users to go to v2, so it’s reasonable to say that third party estimator developers would want to support v1 and v2 for a little while. How much work does a third party developer need to do to make sure their code works with v1 and v2 ?