pandas: BUG: interchange bitmasks not supported in interchange/from_dataframe.py
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Sorry, this is not easily reproducible as the dataframe interchange protocol for pyarrow is still work in progress but I think the error is quite clear:
import pyarrow as pa
table = pa.table({"a": [1, 2, 3, None]})
exchange_df = table.__dataframe__()
from pandas.core.interchange.from_dataframe import from_dataframe
from_dataframe(exchange_df)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 53, in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 74, in _from_dataframe
pandas_df = protocol_df_chunk_to_pandas(chunk)
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 122, in protocol_df_chunk_to_pandas
columns[name], buf = primitive_column_to_ndarray(col)
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 160, in primitive_column_to_ndarray
data = set_nulls(data, col, buffers["validity"])
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 504, in set_nulls
null_pos = buffer_to_ndarray(valid_buff, valid_dtype, col.offset, col.size)
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 395, in buffer_to_ndarray
raise NotImplementedError(f"Conversion for {dtype} is not yet supported.")
NotImplementedError: Conversion for (<DtypeKind.BOOL: 20>, 1, 'b', '=') is not yet supported.
Issue Description
I am currently working on implementing a dataframe interchange protocol for pyarrow.Table in Apache Arrow project (https://github.com/apache/arrow/pull/14613).
I am using pandas implementation to test that the produced __dataframe__ object can be correctly consumed.
When consuming a pyarrow.Table with missing values I get an NotImplementedError. The bitmasks, used by PyArrow to represent nulls in a given column, can not be converted.
But if I look at the code in from_dataframe.py: https://github.com/pandas-dev/pandas/blob/70121c75a0e2a42e31746b6c205c7bb9e4b9b930/pandas/core/interchange/from_dataframe.py#L405-L415
I would think this is not intentional and that the _NP_DTYPES should include {1: bool}
https://github.com/pandas-dev/pandas/blob/70121c75a0e2a42e31746b6c205c7bb9e4b9b930/pandas/core/interchange/from_dataframe.py#L393
Expected Behavior
The bitmask can be converted to ndarray by the current pandas implementation of the dataframe interchange protocol and the code below could work for missing values also:
>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3, 4]})
>>> exchange_df = table.__dataframe__()
>>> exchange_df._df
pyarrow.Table
a: int64
----
a: [[1,2,3,4]]
>>> from pandas.core.interchange.from_dataframe import from_dataframe
>>> from_dataframe(exchange_df)
a
0 1
1 2
2 3
3 4
Installed Versions
INSTALLED VERSIONS
commit : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763 python : 3.9.14.final.0 python-bits : 64 OS : Darwin OS-release : 21.6.0 Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:46 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T8101 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8
pandas : 1.5.0 numpy : 1.22.3 pytz : 2022.1 dateutil : 2.8.2 setuptools : 65.5.1 pip : 22.3.1 Cython : 0.29.28 pytest : 7.1.3 hypothesis : 6.39.4 sphinx : 4.3.2 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.1.1 pandas_datareader: None bs4 : 4.10.0 bottleneck : None brotli : None fastparquet : None fsspec : 2022.02.0 gcsfs : 2022.02.0 matplotlib : 3.6.2 numba : 0.56.4 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 11.0.0.dev117+geeca8a4e3.d20221122 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : 0.9.0 xarray : 2022.11.0 xlrd : None xlwt : None zstandard : None tzdata : None
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (12 by maintainers)
thanks! yeah if I use
then it all seems to work as-expected
Ignore my comment please - thought it would be a good idea but it would not work for general dataframe library that uses bitmasks.
This needs some custom code, since numpy only supports arrays with size of number of bytes, not bits (minimum element length is 1 byte). So you can’t just view this bitmap as a numpy array without some custom processing.
We do have that processing in our arrow_utils (for converting pyarrow arrays to masked arrays), however, that assumes that pyarrow is installed, so that would make accepting bitmasks in the dataframe interchange protocol dependent on pyarrow (which is probably fine, though)