scikit-learn: Handle np.nan / missing values in SplineTransformer
I think it would be quite natural to add an option to SplineTransformer
to accept inputs with missing values as follows:
handle_missing="error"
the default (keep current behavior)handle_missing="zero"
/“constant”: encode missing values by setting all output features for that input column to 0 (or some other constant, see discussion below),handle_missing="indicator"
: append an extra binary feature as missingness indicator and encode missing values as 0 on the remaining output features.
Note that handle_missing="indicator"
would be different and statistically more meaningful than SimpleImputer(strategy="mean", add_indicator=True)
with SplineTransformer
and furthermore would make for leaner ML pipelines (better UX).
I am not sure if we need to add the handle_missing="zero"
option. It would break the property the sum of output values of a given SplineTransformer
encoding always sum to 1 while handle_missing="indicator"
would preserve this property (in addition to make missingness more explicit to the downstream model in case missingness is informative one way or another).
If we want to preserve the sum to 1 property while not adding an explicit missingness indicator feature, maybe we could instead provide handle_missing="constant"
(not sure about the name) that would encode missing values as 1 / n_outputs
to preserve the “sum to 1” property. Not entirely sure if this would result in a more interesting prior than the zero encoding.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 27 (27 by maintainers)
I will take this issue.
Yes @lorentzenchr, I am working on this in the PR, where (to address your concerns) I now got rid of the code to concatenate the indicator column to the rest of X_transform. We can still add this later if needed.
I think we should definitely implement at least 1 (
handle_missing="indicator"
). because it makes it possible to explicitly model missingness as target-predictive conditional information.That being said, make it is possible to not add those explicit features (via
handle_missing="ignore"
) could also be helpful to nudge the model towards not using the missingness pattern and avoid overfitting it when the modeler has prior knowledge that it should not be informative.So +1 for implementing both in hindsight. Also,
handle_missing="ignore"
should be very easy to add oncehandle_missing="indicator"
is implemented. @StefanieSenger feel free to only focus onhandle_missing="indicator"
in your current PR and we can defer the implementation ofhandle_missing="ignore"
to a follow-up work.Ok, then let’s add missing value support.
I’m not sure if I understand your point here @lorentzenchr
The splines move the input into a very different space. The coefficient for what used to be 0, is now meaningful, cause it’s not 0 anymore. But even if that was true, that would only hold for linear models. The other models can pick up signal from 0 and use that, so @ogrisel 's argument holds.
I read the whole thread with a fresh eye now, and I’m quite convinced that adding missing values here makes very much sense.
But I think there are two ways to go about adding “missing value support” here:
SplineTransformer(handle_missing="indicator")
adds an indicator for missing value and returns 0 for all the splines if that’s the caseSplineTransformer(handle_missing="ignore")
returns 0 on all splines and we add aMissingIndicator
in the pipeline which would be equivalent to the former implementation. Then we could document how this can be implemented by the user.I personally don’t know where I stand between the two options. But I think we need at least one of the two.