Boruta-Shap: [BUG] BorutaSHAP.py load Boston Import Error
Describe the bug
Load Boston in the Boruta.py leads to an import error. This is due to Sckit-learn version 1.2 and above.
To Reproduce
Steps to reproduce the behavior: from BorutaShap import BorutaShap
Expected behavior
Package would import normally
Output
ImportError Traceback (most recent call last) <ipython-input-17-f722359accd6> in <module> ----> 1 from BorutaShap import BorutaShap
1 frames /usr/local/lib/python3.8/dist-packages/sklearn/datasets/init.py in getattr(name) 154 “”" 155 ) –> 156 raise ImportError(msg) 157 try: 158 return globals()[name]
ImportError:
load_boston
has been removed from scikit-learn since version 1.2.
The Boston housing prices dataset has an ethical problem: as investigated in [1], the authors of this dataset engineered a non-invertible variable “B” assuming that racial self-segregation had a positive impact on house prices [2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption. `` The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original source::
import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
Alternative datasets include the California housing dataset and the Ames housing dataset. You can load the datasets as follows::
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
for the California housing dataset and::
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
for the Ames housing dataset.
[1] M Carlisle. “Racist data destruction?” https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8
[2] Harrison Jr, David, and Daniel L. Rubinfeld. “Hedonic housing prices and the demand for clean air.” Journal of environmental economics and management 5.1 (1978): 81-102. https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 8
- Comments: 24 (4 by maintainers)
Commits related to this issue
- in response to [BUG] BorutaSHAP.py load Boston Import Error #111 Scikit-learn >1.2 do not support the use of the Boston dataset from sklearn.datasets. This problem was raised back in december: https:... — committed to IanWord/Boruta-Shap by IanWord a year ago
- Merge pull request #114 from IanWord/patch-1 in response to [BUG] BorutaSHAP.py load Boston Import Error #111 — committed to Ekeany/Boruta-Shap by Ekeany a year ago
@IanWord, you mentioned above that the issue is solved, but when I install the version that is currently in PyPI, the bug appears. Could you update the version in PyPI, please? Thank you!
I am doing my master’s thesis and would not mind trying to use the functionality of this package combined with newer scikit-learn capabilities. I have added a proposed solution to the problem, merely replacing load_boston() with load_diabetes(). Hopefully the maintainers will react on the issue. Whether they accept my suggestion or do something else.
new version 1.0.17 uploaded to PyPi with the changes discussed here.
The solution is to install the tool after cloning the repo locally.
First,
git clone
the repo. Then, access the main folderBorutaShap
and install usingpip install -e .
At that point the package works fine. I’m on an M1 mac btw and no problems so far.
Landword, this code is from github. During the class definition process, the Boston data part was changed to the diabets data.
!pip install shap from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, IsolationForest from sklearn.datasets import load_breast_cancer, load_diabetes from statsmodels.stats.multitest import multipletests from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.cluster import KMeans from scipy.sparse import issparse from scipy.stats import binom_test, ks_2samp import matplotlib.pyplot as plt from tqdm.auto import tqdm import random import pandas as pd import numpy as np from numpy.random import choice import seaborn as sns import shap import os import re
import warnings warnings.filterwarnings(“ignore”)
!pip install shap from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, IsolationForest from sklearn.datasets import load_breast_cancer, load_diabetes from statsmodels.stats.multitest import multipletests from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.cluster import KMeans from scipy.sparse import issparse from scipy.stats import binom_test, ks_2samp import matplotlib.pyplot as plt from tqdm.auto import tqdm import random import pandas as pd import numpy as np from numpy.random import choice import seaborn as sns import shap import os import re
import warnings warnings.filterwarnings(“ignore”)
class BorutaShap:
def load_data(data_type=‘classification’):