ibis: bug: utf decode error for parquet file during pandas conversion

What happened?

I’ve got a parquet file on S3 that I’m operating against with ibis. It gets loaded first as a pyarrow dataset, registered as a table with the duckdb backend, and then loaded to pandas.

dataset = ds.dataset(
    source=s3_file,
    format="parquet",
    filesystem=s3_filesystem,
    partitioning="hive",
)
con = ibis.duckdb.connect()
ibis_table = con.register(dataset)
df = ibis_table.to_pandas()

This raises: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 1: invalid continuation byte

However if I try to read each column individually they all load correctly:

for col in ibis_table.schema():
    df = ibis_table.select(col).to_pandas()

Both pyarrow and duckdb can read the parquet file without issue.

What version of ibis are you using?

ibis==6.0.0 pandas==2.0.1 pyarrow==12.0.1 duckdb==0.8.1

What backend(s) are you using, if any?

DuckDB

Relevant log output

UnicodeDecodeError                        Traceback (most recent call last)
Cell In[35], line 1
----> 1 df = ibis_table.to_pandas()
File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/ibis/expr/types/relations.py:2693, in Table.to_pandas(self=r0 := DatabaseTable: _ibis_read_in_memory_uc4anu...ion[r0]
   2685 def to_pandas(self, **kwargs) -> pd.DataFrame:
   2686     """Convert a table expression to a pandas DataFrame.
   2687
   2688     Parameters
   (...)
   2691         Same as keyword arguments to [`execute`][ibis.expr.types.core.Expr.execute]
   2692     """
-> 2693     return self.execute(**kwargs)

File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/ibis/expr/types/core.py:296, in Expr.execute(self=r0 := DatabaseTable: _ibis_read_in_memory_uc4anu...ion[r0]
    269 def execute(
    270     self,
    271     limit: int | str | None = 'default',
   (...)
    274     **kwargs: Any,
    275 ):
    276     """Execute an expression against its backend if one exists.
    277
    278     Parameters
   (...)
    294         Keyword arguments
    295     """
--> 296     return self._find_backend(use_default=True).execute(
    297         self, limit=limit, timecontext=timecontext, params=params, **kwargs
    298     )

File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/ibis/backends/base/sql/__init__.py:263, in BaseSQLBackend.execute(self=<ibis.backends.duckdb.Backend object>, expr=r0 := DatabaseTable: _ibis_read_in_memory_uc4anu...ion[r0]
    260 schema = expr.as_table().schema()
    262 with self._safe_raw_sql(sql, **kwargs) as cursor:
--> 263     result = self.fetch_from_cursor(cursor, schema)
    265 if hasattr(getattr(query_ast, 'dml', query_ast), 'result_handler'):
    266     result = query_ast.dml.result_handler(result)

File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py:881, in Backend.fetch_from_cursor(self=<ibis.backends.duckdb.Backend object>, cursor=<sqlalchemy.engine.cursor.LegacyCursorResult object>, schema=ibis.Schema {
    876 import pyarrow.types as pat
    878 table = cursor.cursor.fetch_arrow_table()
    880 df = pd.DataFrame(
--> 881     {
    882         name: (
    883             col.to_pylist()
    884             if (
    885                 pat.is_nested(col.type)
    886                 or
    887                 # pyarrow / duckdb type null literals columns as int32?
    888                 # but calling `to_pylist()` will render it as None
    889                 col.null_count
    890             )
    891             else col.to_pandas(timestamp_as_object=True)
    892         )
    893         for name, col in zip(table.column_names, table.columns)
    894     }
    895 )
    896 return PandasData.convert_table(df, schema)

File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py:883, in <dictcomp>(.0=<zip object>)
    876 import pyarrow.types as pat
    878 table = cursor.cursor.fetch_arrow_table()
    880 df = pd.DataFrame(
    881     {
    882         name: (
--> 883             col.to_pylist()
    884             if (
    885                 pat.is_nested(col.type)
    886                 or
    887                 # pyarrow / duckdb type null literals columns as int32?
    888                 # but calling `to_pylist()` will render it as None
    889                 col.null_count
    890             )
    891             else col.to_pandas(timestamp_as_object=True)
    892         )
    893         for name, col in zip(table.column_names, table.columns)
    894     }
    895 )
    896 return PandasData.convert_table(df, schema)

File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/pyarrow/table.pxi:1312, in pyarrow.lib.ChunkedArray.to_pylist()

File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/pyarrow/array.pxi:1521, in pyarrow.lib.Array.to_pylist()

File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/pyarrow/scalar.pxi:632, in pyarrow.lib.StringScalar.as_py()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 0: invalid continuation byte

Code of Conduct

I agree to follow this project’s Code of Conduct

About this issue

Original URL
State: closed
Created a year ago
Comments: 19 (13 by maintainers)

Most upvoted comments

Thank you!

cpcloud on Sep 15, 2023

Thanks for powering through this @caleboverman and @gforsyth!

cpcloud on Aug 1, 2023

I suspect it’s actually a DuckDB issue – I haven’t opened an upstream issue yet – I want to try to confirm which project is the main culprit here.

gforsyth on Jul 31, 2023