ydata-profiling: Potential incompatiblity with Pandas 1.4.0
Describe the bug
Pandas version 1.4.0 was release few days ago and some tests start failing. I was able to reproduce with a minimum example which is failing with Pandas 1.4.0 and working with Pandas 1.3.5.
To Reproduce
import pandas as pd
import pandas_profiling
data = {"col1": [1, 2], "col2": [3, 4]}
dataframe = pd.DataFrame(data=data)
profile = pandas_profiling.ProfileReport(dataframe, minimal=False)
profile.to_html()
When running with Pandas 1.4.0, I get the following traceback:
Traceback (most recent call last):
File "/tmp/bug.py", line 8, in <module>
profile.to_html()
File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 368, in to_html
return self.html
File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 185, in html
self._html = self._render_html()
File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 287, in _render_html
report = self.report
File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 179, in report
self._report = get_report_structure(self.config, self.description_set)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 161, in description_set
self._description_set = describe_df(
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/describe.py", line 71, in describe
series_description = get_series_descriptions(
File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
return func(*args, **kwargs)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 92, in pandas_get_series_descriptions
for i, (column, description) in enumerate(
File "/home/lothiraldan/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
File "/home/lothiraldan/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 72, in multiprocess_1d
return column, describe_1d(config, series, summarizer, typeset)
File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
return func(*args, **kwargs)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 50, in pandas_describe_1d
return summarizer.summarize(config, series, dtype=vtype)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summarizer.py", line 37, in summarize
_, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)})
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 62, in handle
return op(*args)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
return f(*res)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
return f(*res)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
return f(*res)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 17, in func2
res = g(*x)
File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
return func(*args, **kwargs)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summary_algorithms.py", line 65, in inner
return fn(config, series, summary)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summary_algorithms.py", line 82, in inner
return fn(config, series, summary)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py", line 205, in pandas_describe_categorical_1d
summary.update(length_summary_vc(value_counts))
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py", line 162, in length_summary_vc
"median_length": weighted_median(
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/utils_pandas.py", line 13, in weighted_median
w_median = (data[weights == np.max(weights)])[0]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
If I try changing the minimal from False to True, the script is now passing.
Version information:
Failing environment
Python version: Python 3.9.1 Pip version: pip 21.3.1 Pandas and pandas-profiling versions: 1.4.0 | 3.1.0 Full pip list:
Package Version
--------------------- ---------
attrs 21.4.0
certifi 2021.10.8
charset-normalizer 2.0.10
cycler 0.11.0
fonttools 4.28.5
htmlmin 0.1.12
idna 3.3
ImageHash 4.2.1
Jinja2 3.0.3
joblib 1.0.1
kiwisolver 1.3.2
MarkupSafe 2.0.1
matplotlib 3.5.1
missingno 0.5.0
multimethod 1.6
networkx 2.6.3
numpy 1.22.1
packaging 21.3
pandas 1.4.0
pandas-profiling 3.1.0
phik 0.12.0
Pillow 9.0.0
pip 21.3.1
pydantic 1.9.0
pyparsing 3.0.7
python-dateutil 2.8.2
pytz 2021.3
PyWavelets 1.2.0
PyYAML 6.0
requests 2.27.1
scipy 1.7.3
seaborn 0.11.2
setuptools 60.0.5
six 1.16.0
tangled-up-in-unicode 0.1.0
tqdm 4.62.3
typing_extensions 4.0.1
urllib3 1.26.8
visions 0.7.4
wheel 0.37.1
Working environment
Python version: Python 3.9.1 Pip version: pip 21.3.1 Pandas and pandas-profiling versions: 1.3.5 | 3.1.0 Full pip list:
Package Version
--------------------- ---------
attrs 21.4.0
certifi 2021.10.8
charset-normalizer 2.0.10
cycler 0.11.0
fonttools 4.28.5
htmlmin 0.1.12
idna 3.3
ImageHash 4.2.1
Jinja2 3.0.3
joblib 1.0.1
kiwisolver 1.3.2
MarkupSafe 2.0.1
matplotlib 3.5.1
missingno 0.5.0
multimethod 1.6
networkx 2.6.3
numpy 1.22.1
packaging 21.3
pandas 1.3.5
pandas-profiling 3.1.0
phik 0.12.0
Pillow 9.0.0
pip 21.3.1
pydantic 1.9.0
pyparsing 3.0.7
python-dateutil 2.8.2
pytz 2021.3
PyWavelets 1.2.0
PyYAML 6.0
requests 2.27.1
scipy 1.7.3
seaborn 0.11.2
setuptools 60.0.5
six 1.16.0
tangled-up-in-unicode 0.1.0
tqdm 4.62.3
typing_extensions 4.0.1
urllib3 1.26.8
visions 0.7.4
wheel 0.37.1
Let me know if I can provide more details and thank you for your good work!
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 25
- Comments: 15 (1 by maintainers)
Commits related to this issue
- fix: pandas 1.4.1 compatibility #911 fix (#945) — committed to ydataai/ydata-profiling by endremborza 2 years ago
I investigated a bit and I think I identified a behavior change on the following line: https://github.com/pandas-profiling/pandas-profiling/blob/eac60a0b4e9a278a0ca44d8c712a599bdb41ec71/src/pandas_profiling/model/pandas/describe_categorical_pandas.py#L175
With pandas 1.3.5,
length.indexisIndex([1, 1], dtype='object', name='col1')while with pandas 1.4.0, it isIndex([1, 1], dtype='Int64', name='col1').Later in the code https://github.com/pandas-profiling/pandas-profiling/blob/eac60a0b4e9a278a0ca44d8c712a599bdb41ec71/src/pandas_profiling/model/pandas/utils_pandas.py#L13,
weights == np.max(weights)is an instance of<class 'numpy.ndarray'>while with pandas 1.4.0, it is now an instance of<class 'pandas.core.arrays.boolean.BooleanArray'>.Taking a look at the release note, it might be related to https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#index-can-hold-arbitrary-extensionarrays.
I am not sure what the correct fix is and if you plan to support both version of pandas, let me know if I can provide more help.
This is a weird one… A simple fix is to modify line 13 identified by @Lothiraldan to use np.where (as is done in the else condition below) i.e.
I haven’t investigated too deeply but with the above change tests appear to now pass on pandas 1.40+. I haven’t investigated much beyond the fix but for some reason the result of the
==comparison in this case isn’t producing a boolean ndarray.I am also encountering this issue with pandas1.4.1, numpy1.21.5, pandas-profiling3.1.0 in a python3.8
venvon mac book running Monterey.I am having the same problem with Pandas 1.4.1 , Numpy 1.22.1 pandas-profiling 3.1.0
Encountering the same issue with pandas_profiling 3.1.0, pandas 1.4.1, numpy 1.21.2 on Monterey with a conda environment running Python 3.10.2
I am encountering this issue with pandas 1.2.4, pandas_profiling 3.1.0, numpy 1.20.1 in python 3.8.8 on Ubuntu 20.04 LTS.