scikit-learn: Handle np.nan / missing values in SplineTransformer

I think it would be quite natural to add an option to SplineTransformer to accept inputs with missing values as follows:

handle_missing="error" the default (keep current behavior)
handle_missing="zero"/“constant”: encode missing values by setting all output features for that input column to 0 (or some other constant, see discussion below),
handle_missing="indicator": append an extra binary feature as missingness indicator and encode missing values as 0 on the remaining output features.

Note that handle_missing="indicator" would be different and statistically more meaningful than SimpleImputer(strategy="mean", add_indicator=True) with SplineTransformer and furthermore would make for leaner ML pipelines (better UX).

I am not sure if we need to add the handle_missing="zero" option. It would break the property the sum of output values of a given SplineTransformer encoding always sum to 1 while handle_missing="indicator" would preserve this property (in addition to make missingness more explicit to the downstream model in case missingness is informative one way or another).

If we want to preserve the sum to 1 property while not adding an explicit missingness indicator feature, maybe we could instead provide handle_missing="constant" (not sure about the name) that would encode missing values as 1 / n_outputs to preserve the “sum to 1” property. Not entirely sure if this would result in a more interesting prior than the zero encoding.

About this issue

Original URL
State: open
Created a year ago
Comments: 27 (27 by maintainers)

Most upvoted comments

I will take this issue.

StefanieSenger on Dec 9, 2023

Yes @lorentzenchr, I am working on this in the PR, where (to address your concerns) I now got rid of the code to concatenate the indicator column to the rest of X_transform. We can still add this later if needed.

StefanieSenger on Mar 14, 2024

But I think there are two ways to go about adding “missing value support” here:

SplineTransformer(handle_missing=“indicator”) adds an indicator for missing value and returns 0 for all the splines if that’s the case

SplineTransformer(handle_missing=“ignore”) returns 0 on all splines and we add a MissingIndicator in the pipeline which would be equivalent to the former implementation. Then we could document how this can be implemented by the user.

I personally don’t know where I stand between the two options. But I think we need at least one of the two.

I think we should definitely implement at least 1 (handle_missing="indicator"). because it makes it possible to explicitly model missingness as target-predictive conditional information.

That being said, make it is possible to not add those explicit features (via handle_missing="ignore") could also be helpful to nudge the model towards not using the missingness pattern and avoid overfitting it when the modeler has prior knowledge that it should not be informative.

So +1 for implementing both in hindsight. Also, handle_missing="ignore" should be very easy to add once handle_missing="indicator" is implemented. @StefanieSenger feel free to only focus on handle_missing="indicator" in your current PR and we can defer the implementation of handle_missing="ignore" to a follow-up work.

ogrisel on Mar 1, 2024

Ok, then let’s add missing value support.

lorentzenchr on Feb 29, 2024

I’m not sure if I understand your point here @lorentzenchr

The splines move the input into a very different space. The coefficient for what used to be 0, is now meaningful, cause it’s not 0 anymore. But even if that was true, that would only hold for linear models. The other models can pick up signal from 0 and use that, so @ogrisel 's argument holds.

I read the whole thread with a fresh eye now, and I’m quite convinced that adding missing values here makes very much sense.

But I think there are two ways to go about adding “missing value support” here:

SplineTransformer(handle_missing="indicator") adds an indicator for missing value and returns 0 for all the splines if that’s the case
SplineTransformer(handle_missing="ignore") returns 0 on all splines and we add a MissingIndicator in the pipeline which would be equivalent to the former implementation. Then we could document how this can be implemented by the user.

I personally don’t know where I stand between the two options. But I think we need at least one of the two.

adrinjalali on Feb 29, 2024