ydata-profiling: Potential incompatiblity with Pandas 1.4.0

Describe the bug

Pandas version 1.4.0 was release few days ago and some tests start failing. I was able to reproduce with a minimum example which is failing with Pandas 1.4.0 and working with Pandas 1.3.5.

To Reproduce

import pandas as pd
import pandas_profiling

data = {"col1": [1, 2], "col2": [3, 4]}
dataframe = pd.DataFrame(data=data)

profile = pandas_profiling.ProfileReport(dataframe, minimal=False)
profile.to_html()

When running with Pandas 1.4.0, I get the following traceback:

Traceback (most recent call last):
  File "/tmp/bug.py", line 8, in <module>
    profile.to_html()
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 368, in to_html
    return self.html
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 185, in html
    self._html = self._render_html()
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 287, in _render_html
    report = self.report
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 179, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 161, in description_set
    self._description_set = describe_df(
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/describe.py", line 71, in describe
    series_description = get_series_descriptions(
  File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
    return func(*args, **kwargs)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 92, in pandas_get_series_descriptions
    for i, (column, description) in enumerate(
  File "/home/lothiraldan/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
  File "/home/lothiraldan/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 72, in multiprocess_1d
    return column, describe_1d(config, series, summarizer, typeset)
  File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
    return func(*args, **kwargs)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 50, in pandas_describe_1d
    return summarizer.summarize(config, series, dtype=vtype)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summarizer.py", line 37, in summarize
    _, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)})
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 62, in handle
    return op(*args)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
    return f(*res)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
    return f(*res)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
    return f(*res)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 17, in func2
    res = g(*x)
  File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
    return func(*args, **kwargs)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summary_algorithms.py", line 65, in inner
    return fn(config, series, summary)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summary_algorithms.py", line 82, in inner
    return fn(config, series, summary)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py", line 205, in pandas_describe_categorical_1d
    summary.update(length_summary_vc(value_counts))
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py", line 162, in length_summary_vc
    "median_length": weighted_median(
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/utils_pandas.py", line 13, in weighted_median
    w_median = (data[weights == np.max(weights)])[0]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

If I try changing the minimal from False to True, the script is now passing.

Version information:

Failing environment

Python version: Python 3.9.1 Pip version: pip 21.3.1 Pandas and pandas-profiling versions: 1.4.0 | 3.1.0 Full pip list:

Package               Version
--------------------- ---------
attrs                 21.4.0
certifi               2021.10.8
charset-normalizer    2.0.10
cycler                0.11.0
fonttools             4.28.5
htmlmin               0.1.12
idna                  3.3
ImageHash             4.2.1
Jinja2                3.0.3
joblib                1.0.1
kiwisolver            1.3.2
MarkupSafe            2.0.1
matplotlib            3.5.1
missingno             0.5.0
multimethod           1.6
networkx              2.6.3
numpy                 1.22.1
packaging             21.3
pandas                1.4.0
pandas-profiling      3.1.0
phik                  0.12.0
Pillow                9.0.0
pip                   21.3.1
pydantic              1.9.0
pyparsing             3.0.7
python-dateutil       2.8.2
pytz                  2021.3
PyWavelets            1.2.0
PyYAML                6.0
requests              2.27.1
scipy                 1.7.3
seaborn               0.11.2
setuptools            60.0.5
six                   1.16.0
tangled-up-in-unicode 0.1.0
tqdm                  4.62.3
typing_extensions     4.0.1
urllib3               1.26.8
visions               0.7.4
wheel                 0.37.1

Working environment

Python version: Python 3.9.1 Pip version: pip 21.3.1 Pandas and pandas-profiling versions: 1.3.5 | 3.1.0 Full pip list:

Package               Version
--------------------- ---------
attrs                 21.4.0
certifi               2021.10.8
charset-normalizer    2.0.10
cycler                0.11.0
fonttools             4.28.5
htmlmin               0.1.12
idna                  3.3
ImageHash             4.2.1
Jinja2                3.0.3
joblib                1.0.1
kiwisolver            1.3.2
MarkupSafe            2.0.1
matplotlib            3.5.1
missingno             0.5.0
multimethod           1.6
networkx              2.6.3
numpy                 1.22.1
packaging             21.3
pandas                1.3.5
pandas-profiling      3.1.0
phik                  0.12.0
Pillow                9.0.0
pip                   21.3.1
pydantic              1.9.0
pyparsing             3.0.7
python-dateutil       2.8.2
pytz                  2021.3
PyWavelets            1.2.0
PyYAML                6.0
requests              2.27.1
scipy                 1.7.3
seaborn               0.11.2
setuptools            60.0.5
six                   1.16.0
tangled-up-in-unicode 0.1.0
tqdm                  4.62.3
typing_extensions     4.0.1
urllib3               1.26.8
visions               0.7.4
wheel                 0.37.1

Let me know if I can provide more details and thank you for your good work!

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 25
  • Comments: 15 (1 by maintainers)

Commits related to this issue

Most upvoted comments

I investigated a bit and I think I identified a behavior change on the following line: https://github.com/pandas-profiling/pandas-profiling/blob/eac60a0b4e9a278a0ca44d8c712a599bdb41ec71/src/pandas_profiling/model/pandas/describe_categorical_pandas.py#L175

With pandas 1.3.5, length.index is Index([1, 1], dtype='object', name='col1') while with pandas 1.4.0, it is Index([1, 1], dtype='Int64', name='col1').

Later in the code https://github.com/pandas-profiling/pandas-profiling/blob/eac60a0b4e9a278a0ca44d8c712a599bdb41ec71/src/pandas_profiling/model/pandas/utils_pandas.py#L13, weights == np.max(weights) is an instance of <class 'numpy.ndarray'> while with pandas 1.4.0, it is now an instance of <class 'pandas.core.arrays.boolean.BooleanArray'>.

Taking a look at the release note, it might be related to https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#index-can-hold-arbitrary-extensionarrays.

I am not sure what the correct fix is and if you plan to support both version of pandas, let me know if I can provide more help.

This is a weird one… A simple fix is to modify line 13 identified by @Lothiraldan to use np.where (as is done in the else condition below) i.e.

w_median = (data[np.where(weights == np.max(weights))])[0]

I haven’t investigated too deeply but with the above change tests appear to now pass on pandas 1.40+. I haven’t investigated much beyond the fix but for some reason the result of the == comparison in this case isn’t producing a boolean ndarray.

I am also encountering this issue with pandas1.4.1, numpy1.21.5, pandas-profiling3.1.0 in a python3.8 venv on mac book running Monterey.

I am having the same problem with Pandas 1.4.1 , Numpy 1.22.1 pandas-profiling 3.1.0

Encountering the same issue with pandas_profiling 3.1.0, pandas 1.4.1, numpy 1.21.2 on Monterey with a conda environment running Python 3.10.2

I am encountering this issue with pandas 1.2.4, pandas_profiling 3.1.0, numpy 1.20.1 in python 3.8.8 on Ubuntu 20.04 LTS.