pandas: BUG: read_parquet no longer supports file-like objects
Code Sample, a copy-pastable example
from io import BytesIO
import pandas as pd
buffer = BytesIO()
df = pd.DataFrame([1,2,3], columns=["a"])
df.to_parquet(buffer)
df2 = pd.read_parquet(buffer)
Problem description
The current behavior of read_parquet(buffer)
is that it raises the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "./working_dir/tvenv/lib/python3.7/site-packages/pandas/io/parquet.py", line 315, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File "./working_dir/tvenv/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read
path, filesystem=get_fs_for_path(path), **kwargs
File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/parquet.py", line 1162, in __init__
self.paths = _parse_uri(path_or_paths)
File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/parquet.py", line 47, in _parse_uri
path = _stringify_path(path)
File "./working_dir/tvenv/lib/python3.7/site-packages/pyarrow/util.py", line 67, in _stringify_path
raise TypeError("not a path-like object")
TypeError: not a path-like object
Expected Output
Instead, read_parquet(buffer)
should return a new DataFrame with the same contents as the serialized DataFrame stored in buffer
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-99-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.0.4 numpy : 1.18.4 pytz : 2020.1 dateutil : 2.8.1 pip : 9.0.1 setuptools : 39.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 0.999999999 pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.17.1 pytables : None pytest : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 8
- Comments: 26 (14 by maintainers)
Commits related to this issue
- ARROW-9021: [Python] Add the filesystem explanation to parquet.read_table docstring Use same doc string as ParquetDataset. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDatase... — committed to apache/arrow by alimcmaster1 4 years ago
- forcing pandas != 1.0.4 There was a bug introduced in pandas 1.0.4 that caused pd.read_parquet to no longer be able to handle file-like objects. They're fixing it in 1.0.5. This change will skip 1.0.... — committed to NREL/buildstockbatch by nmerket 4 years ago
@jreback I understood your point, but I was referring to reporting the issue to pyarrow, not the fact that pyarrow is causing the traceback.
That said, it’s still an assumption that pyarrow caused the regression. That’s why I’m reluctant to report anything to pyarrow in the first place.
Let’s dig into the issue some more:
ParquetDataset
class, which pandas now uses in the new implementation forpandas.read_parquet
.ParquetDataset
class that would have caused this regression. In fact, neither the current documentation, nor the documentation for previous versions, states anything about supporting file-like objects, only paths._ParquetDatasetV2
class, which uses pyarrow’s dataset implementation, does not and has not supported (for at least several months back) file-like objects.So there are three possibilities:
@claytonlemons I am encountering the same issue.
If I downgrade from 1.0.4 --> 1.0.3 (while keeping the pyarrow version the same), I can again read from BytesIO buffers without issue. Since upgrading the pandas version from 1.0.3 --> 1.0.4 seems both necessary and sufficient to cause the file-like object reading issues, it seems like it may indeed be correct to consider this as an issue with pandas, not pyarrow.
@jreback Would you consider reopening this issue?
@austospumanto
The fix for master pandas 1.1 is https://github.com/pandas-dev/pandas/pull/34500/files#diff-cbd427661c53f1dcde6ec5fb9ab0effaR134
We can potentially add tests that’s cover a few more of the kwargs since we clearly currently don’t have coverage here.
Another consequence of using
ParquetDataset
instead ofread_table
is that additional keyword arguments are passed both to the constructor and the read method:But since
ParquetDataset.read
doesn’t support all arguments supported inParquetDataset.__init__
, this leads to TypeErrors:I also have a simular issues since version 1.0.4 and had to downgrade to 1.0.3. I was able to read files from an azure blob storage if I provided the https url and as param sas token directly. since 1.0.4 this is completely broken.
Example on 1.0.4
pd.read_parquet("https://*REDACTED*.blob.core.windows.net/raw/*REDACTED*/12.parquet?sv=*REDACTED*&ss=*REDACTED*&srt=*REDACTED*&sp=*REDACTED*&se=*REDACTED*&st=*REDACTED*&spr=https&sig=*REDACTED*")
Raises
OSError: Passed non-file path: https://*REDACTED*.blob.core.windows.net/raw/*REDACTED*/12.parquet?sv=*REDACTED*&ss=*REDACTED*&srt=*REDACTED*&sp=*REDACTED*&se=*REDACTED*&st=*REDACTED*&spr=https&sig=*REDACTED*
This work perfectly on 1.0.3 and forced us to rollback to pandas 1.0.3
Please see the referenced merge request above.
ok sure something must have gone wrong in the backport.
would be helpful to know exactly where