pandas: BUG: read_parquet no longer supports file-like objects

Code Sample, a copy-pastable example

from io import BytesIO
import pandas as pd

buffer = BytesIO()

df = pd.DataFrame([1,2,3], columns=["a"])
df.to_parquet(buffer)

df2 = pd.read_parquet(buffer)

Problem description

The current behavior of read_parquet(buffer) is that it raises the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./working_dir/tvenv/lib/python3.7/site-packages/pandas/io/parquet.py", line 315, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read
    path, filesystem=get_fs_for_path(path), **kwargs
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/parquet.py", line 1162, in __init__
    self.paths = _parse_uri(path_or_paths)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/parquet.py", line 47, in _parse_uri
    path = _stringify_path(path)
  File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/util.py", line 67, in _stringify_path
    raise TypeError("not a path-like object")
TypeError: not a path-like object

Expected Output

Instead, read_parquet(buffer) should return a new DataFrame with the same contents as the serialized DataFrame stored in buffer

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-99-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.4 numpy : 1.18.4 pytz : 2020.1 dateutil : 2.8.1 pip : 9.0.1 setuptools : 39.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 0.999999999 pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.17.1 pytables : None pytest : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 8
  • Comments: 26 (14 by maintainers)

Commits related to this issue

Most upvoted comments

@claytonlemons you are missing the point

@jreback I understood your point, but I was referring to reporting the issue to pyarrow, not the fact that pyarrow is causing the traceback.

That said, it’s still an assumption that pyarrow caused the regression. That’s why I’m reluctant to report anything to pyarrow in the first place.

Let’s dig into the issue some more:

  1. As shown by the stack trace, the first entry point into pyarrow 17.1 is pyarrow/parquet.py:1162. This is the ParquetDataset class, which pandas now uses in the new implementation for pandas.read_parquet.
  2. Running git-blame on parquet.py:1162, I see no recent changes to the ParquetDataset class that would have caused this regression. In fact, neither the current documentation, nor the documentation for previous versions, states anything about supporting file-like objects, only paths.
  3. Even the _ParquetDatasetV2 class, which uses pyarrow’s dataset implementation, does not and has not supported (for at least several months back) file-like objects.

So there are three possibilities:

  1. ParquetDataset supported file-like objects in the past but did not document it.
  2. ParquetDataset supported file-like objects in the past but recently removed support.
  3. ParquetDataset never supported file-like objects, but pandas made an incorrect assumption about the interface and did not properly test against it

@claytonlemons I am encountering the same issue.

If I downgrade from 1.0.4 --> 1.0.3 (while keeping the pyarrow version the same), I can again read from BytesIO buffers without issue. Since upgrading the pandas version from 1.0.3 --> 1.0.4 seems both necessary and sufficient to cause the file-like object reading issues, it seems like it may indeed be correct to consider this as an issue with pandas, not pyarrow.

@jreback Would you consider reopening this issue?

@kepler I wonder if explicitly separating the kwargs into two parameters might be a solution.

@austospumanto

The fix for master pandas 1.1 is https://github.com/pandas-dev/pandas/pull/34500/files#diff-cbd427661c53f1dcde6ec5fb9ab0effaR134

We can potentially add tests that’s cover a few more of the kwargs since we clearly currently don’t have coverage here.

Another consequence of using ParquetDataset instead of read_table is that additional keyword arguments are passed both to the constructor and the read method:

parquet_ds = self.api.parquet.ParquetDataset(
    path, filesystem=get_fs_for_path(path), **kwargs
)
kwargs["columns"] = columns
result = parquet_ds.read_pandas(**kwargs).to_pandas()

But since ParquetDataset.read doesn’t support all arguments supported in ParquetDataset.__init__, this leads to TypeErrors:

    df = pd.read_parquet(file_path, memory_map=True)
  File ".venv/lib/python3.7/site-packages/pandas/io/parquet.py", line 315, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File ".venv/lib/python3.7/site-packages/pandas/io/parquet.py", line 134, in read
    result = parquet_ds.read_pandas(**kwargs).to_pandas()
  File ".venv/lib/python3.7/site-packages/pyarrow/parquet.py", line 1304, in read_pandas
    return self.read(use_pandas_metadata=True, **kwargs)
TypeError: read() got an unexpected keyword argument 'memory_map'

I also have a simular issues since version 1.0.4 and had to downgrade to 1.0.3. I was able to read files from an azure blob storage if I provided the https url and as param sas token directly. since 1.0.4 this is completely broken.

Example on 1.0.4

pd.read_parquet("https://*REDACTED*.blob.core.windows.net/raw/*REDACTED*/12.parquet?sv=*REDACTED*&ss=*REDACTED*&srt=*REDACTED*&sp=*REDACTED*&se=*REDACTED*&st=*REDACTED*&spr=https&sig=*REDACTED*")

Raises OSError: Passed non-file path: https://*REDACTED*.blob.core.windows.net/raw/*REDACTED*/12.parquet?sv=*REDACTED*&ss=*REDACTED*&srt=*REDACTED*&sp=*REDACTED*&se=*REDACTED*&st=*REDACTED*&spr=https&sig=*REDACTED*

This work perfectly on 1.0.3 and forced us to rollback to pandas 1.0.3

would be helpful to know exactly where

Please see the referenced merge request above.

ok sure something must have gone wrong in the backport.

would be helpful to know exactly where