scikit-learn: Potential error caused by different column order
Description
Sometimes it is convenient to first build a model on a recent dataset, save it as a .pkl file and then apply the model to the new dataset. However, in the last project, my friends and I found that the results turned quite wired after applying the .pkl file on the new dataset. Actually, we implemented a binary classifier. We found the probability distribution turned from unimodal distribution to bimodal distribution. Finally, we found out the problem was that the column order of the new dataset was different from the old one. Thus the predictions were totally wrong. I have checked the source code and discovered that the fit function of sklean didn’t save the column values during the process of model training. Thus there was no mean to check whether the column values were consistent during the processing of prediction. We thought it would be better if the column values could be saved during training and then be used to check the column values during predicting.
Steps/Code to Reproduce
#for simplification, consider a very simple case
from sklearn.datasets import load_iris
import pandas as pd
#make a dataframe
iris = load_iris()
X, y = iris.data[:-1,:], iris.target[:-1]
iris_pd = pd.DataFrame(X)
iris_pd.columns = iris.feature_names
iris_pd['target'] = y
from sklearn.cross_validation import train_test_split
train, test = train_test_split(iris_pd, test_size= 0.3)
feature_columns_train = ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
feature_columns_test = ['sepal length (cm)','sepal width (cm)','petal width (cm)','petal length (cm)']
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
lg.fit(train[feature_columns_train], train['target'])
prob1 = lg.predict_proba(test[feature_columns_train])
prob2 = lg.predict_proba(test[feature_columns_test])
Expected Results
Because feature_columns_test is different from feature_columns_train, it is not surprised that prob1 is totally different from prob2 and prob1 should be the right result.
prob1[:5] =
array([[ 3.89507414e-04, 3.20099743e-01, 6.79510750e-01],
[ 4.63256526e-04, 4.65385156e-01, 5.34151587e-01],
[ 8.79704318e-01, 1.20295572e-01, 1.10268420e-07],
[ 7.80611983e-01, 2.19385827e-01, 2.19046022e-06],
[ 2.78898454e-02, 7.77243988e-01, 1.94866167e-01]])
Actual Results
prob2[:5] =
array([[ 4.36321678e-01, 2.25057553e-04, 5.63453265e-01],
[ 4.92513658e-01, 1.76391882e-05, 5.07468703e-01],
[ 9.92946715e-01, 7.05167151e-03, 1.61346947e-06],
[ 9.83726756e-01, 1.62387090e-02, 3.45348884e-05],
[ 5.01392274e-01, 5.37144591e-04, 4.98070581e-01]])
Versions
Linux-2.6.32-642.1.1.el6.x86_64-x86_64-with-redhat-6.7-Santiago
('Python', '2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Dec 6 2015, 18:08:32) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.10.1')
('SciPy', '0.16.0')
('Scikit-Learn', '0.17')
The probable solution
I also implement a very simple solution. Hope this would help. 😃
class SafeLogisticRegression(LogisticRegression):
def fit(self, X, y, sample_weight=None):
self.columns = X.columns
LogisticRegression.fit(self, X, y, sample_weight=None)
def predict_proba(self, X):
new_columns = list(X.columns)
old_columns = list(self.columns)
if new_columns != old_columns:
if len(new_columns) == len(old_columns):
try:
X = X[old_columns]
print "The order of columns has changed. Fixed."
except:
raise ValueError('The columns has changed. Please check.')
else:
raise ValueError('The number of columns has changed.')
return LogisticRegression.predict_proba(self, X)
Then apply this new class:
slg = SafeLogisticRegression(n_jobs=4, random_state=123, verbose=0, penalty='l1', C=1.0)
slg.fit(train[feature_columns_train], train['target'])
Test one: if the column order is changed
prob1 = slg.predict_proba(test[feature_columns_train])
prob2 = slg.predict_proba(test[feature_columns_test])
#The order of columns has changed. Fixed.
Result for test one:
prob1[:5] =
array([[ 3.89507414e-04, 3.20099743e-01, 6.79510750e-01],
[ 4.63256526e-04, 4.65385156e-01, 5.34151587e-01],
[ 8.79704318e-01, 1.20295572e-01, 1.10268420e-07],
[ 7.80611983e-01, 2.19385827e-01, 2.19046022e-06],
[ 2.78898454e-02, 7.77243988e-01, 1.94866167e-01]])
prob2[:5] =
array([[ 3.89507414e-04, 3.20099743e-01, 6.79510750e-01],
[ 4.63256526e-04, 4.65385156e-01, 5.34151587e-01],
[ 8.79704318e-01, 1.20295572e-01, 1.10268420e-07],
[ 7.80611983e-01, 2.19385827e-01, 2.19046022e-06],
[ 2.78898454e-02, 7.77243988e-01, 1.94866167e-01]])
Test two: if the columns are different (different columns)
Simulate by changing one of the column names
prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))
error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-47-84cea68536fe> in <module>()
----> 1 prob3 = slg.predict_proba(test[feature_columns_train].rename(columns={'sepal width (cm)': 'sepal wid (cm)'}))
<ipython-input-21-c3000b030a21> in predict_proba(self, X)
12 print "The order of columns has changed. Fixed."
13 except:
---> 14 raise ValueError('The columns has changed. Please check.')
15 else:
16 raise ValueError('The number of columns has changed.')
ValueError: The columns has changed. Please check.
Test three: if the number of columns changes
Simulate by dropping one column
prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))
error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-48-47c63ae1ac22> in <module>()
----> 1 prob4 = slg.predict_proba(test[feature_columns_train].drop(['sepal width (cm)'], axis=1))
<ipython-input-21-c3000b030a21> in predict_proba(self, X)
14 raise ValueError('The columns has changed. Please check.')
15 else:
---> 16 raise ValueError('The number of columns has changed.')
17 return LogisticRegression.predict_proba(self, X)
ValueError: The number of columns has changed.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 19
- Comments: 46 (29 by maintainers)
I just wanted to add that this feature is extremely important out in the real-world where the model building process and the prediction process (exposed to users) often are completely separate.
The prediction process should not need to know about feature selection, feature order and so forth. It should just take all the features for the specific problem and the estimator knows which ones it needs in what order. This is how it works in other tools. It’s extremely convenient and the fact I prefer that tool over sklearn. Sklearn however offers more estimator types and better performance (as in speed and prediction quality even for same model types). So it would be great to have this here too.
I would imagine this happens either through pandas column names or if you pass in numpy arrays fit and predict could have an additional optional parameter column_names (not that great to create this list). If no column names are present, it works like it does now.
@pjcpjc warning is an option, but this would be very very loud. Many people are passing dataframes. I guess it depends a bit on whether we want them to pass dataframes or not. One problem is that if we warn, that basically tells users to use
.values
everywhere to convert to numpy arrays, so they don’t get a warning. But then this error becomes silent again, so we won nothing. We just made the users change their code so that we can never hope to catch the error.Also, I think having the estimator know the feature names is helpful for model inspection.
I think a wrapper to ensure column order is maintained between fit and predict/transform would be a good idea.
Scikit-learn currently has no concept of column labels. I agree this would be a nice to have feature. Don’t hold your breath for it. Storing the column names at all would be an important first step, which I hope to achieve soonish. This will not be available in 0.18, and the consistency check probably not in 0.19 either. Scikit-learn operates on numpy arrays, which don’t (currently) have column names. This might change in the future, and we might add better integration with pandas.