ibis: bug: utf decode error for parquet file during pandas conversion
What happened?
I’ve got a parquet file on S3 that I’m operating against with ibis. It gets loaded first as a pyarrow dataset, registered as a table with the duckdb backend, and then loaded to pandas.
dataset = ds.dataset(
source=s3_file,
format="parquet",
filesystem=s3_filesystem,
partitioning="hive",
)
con = ibis.duckdb.connect()
ibis_table = con.register(dataset)
df = ibis_table.to_pandas()
This raises:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 1: invalid continuation byte
However if I try to read each column individually they all load correctly:
for col in ibis_table.schema():
df = ibis_table.select(col).to_pandas()
Both pyarrow and duckdb can read the parquet file without issue.
What version of ibis are you using?
ibis==6.0.0 pandas==2.0.1 pyarrow==12.0.1 duckdb==0.8.1
What backend(s) are you using, if any?
DuckDB
Relevant log output
UnicodeDecodeError Traceback (most recent call last)
Cell In[35], line 1
----> 1 df = ibis_table.to_pandas()
File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/ibis/expr/types/relations.py:2693, in Table.to_pandas(self=r0 := DatabaseTable: _ibis_read_in_memory_uc4anu...ion[r0]
2685 def to_pandas(self, **kwargs) -> pd.DataFrame:
2686 """Convert a table expression to a pandas DataFrame.
2687
2688 Parameters
(...)
2691 Same as keyword arguments to [`execute`][ibis.expr.types.core.Expr.execute]
2692 """
-> 2693 return self.execute(**kwargs)
File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/ibis/expr/types/core.py:296, in Expr.execute(self=r0 := DatabaseTable: _ibis_read_in_memory_uc4anu...ion[r0]
269 def execute(
270 self,
271 limit: int | str | None = 'default',
(...)
274 **kwargs: Any,
275 ):
276 """Execute an expression against its backend if one exists.
277
278 Parameters
(...)
294 Keyword arguments
295 """
--> 296 return self._find_backend(use_default=True).execute(
297 self, limit=limit, timecontext=timecontext, params=params, **kwargs
298 )
File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/ibis/backends/base/sql/__init__.py:263, in BaseSQLBackend.execute(self=<ibis.backends.duckdb.Backend object>, expr=r0 := DatabaseTable: _ibis_read_in_memory_uc4anu...ion[r0]
260 schema = expr.as_table().schema()
262 with self._safe_raw_sql(sql, **kwargs) as cursor:
--> 263 result = self.fetch_from_cursor(cursor, schema)
265 if hasattr(getattr(query_ast, 'dml', query_ast), 'result_handler'):
266 result = query_ast.dml.result_handler(result)
File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py:881, in Backend.fetch_from_cursor(self=<ibis.backends.duckdb.Backend object>, cursor=<sqlalchemy.engine.cursor.LegacyCursorResult object>, schema=ibis.Schema {
876 import pyarrow.types as pat
878 table = cursor.cursor.fetch_arrow_table()
880 df = pd.DataFrame(
--> 881 {
882 name: (
883 col.to_pylist()
884 if (
885 pat.is_nested(col.type)
886 or
887 # pyarrow / duckdb type null literals columns as int32?
888 # but calling `to_pylist()` will render it as None
889 col.null_count
890 )
891 else col.to_pandas(timestamp_as_object=True)
892 )
893 for name, col in zip(table.column_names, table.columns)
894 }
895 )
896 return PandasData.convert_table(df, schema)
File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py:883, in <dictcomp>(.0=<zip object>)
876 import pyarrow.types as pat
878 table = cursor.cursor.fetch_arrow_table()
880 df = pd.DataFrame(
881 {
882 name: (
--> 883 col.to_pylist()
884 if (
885 pat.is_nested(col.type)
886 or
887 # pyarrow / duckdb type null literals columns as int32?
888 # but calling `to_pylist()` will render it as None
889 col.null_count
890 )
891 else col.to_pandas(timestamp_as_object=True)
892 )
893 for name, col in zip(table.column_names, table.columns)
894 }
895 )
896 return PandasData.convert_table(df, schema)
File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/pyarrow/table.pxi:1312, in pyarrow.lib.ChunkedArray.to_pylist()
File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/pyarrow/array.pxi:1521, in pyarrow.lib.Array.to_pylist()
File ~/opt/anaconda3/envs/fts/lib/python3.10/site-packages/pyarrow/scalar.pxi:632, in pyarrow.lib.StringScalar.as_py()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 0: invalid continuation byte
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 19 (13 by maintainers)
Thank you!
Thanks for powering through this @caleboverman and @gforsyth!
I suspect it’s actually a DuckDB issue – I haven’t opened an upstream issue yet – I want to try to confirm which project is the main culprit here.