dask: Read_Parquet too slow between versions 1.* and 2.*
Hi all,
I’d like to report a weird slow read_parquet behavior between dask versions.
On all examples I’ll be using:
- CentOS 7.7
- Python 3.6
- Fastparquet 0.4.0 (Installed with pip)
Dask versions were also installed with pip.
pip3 install dask=={version} dask[dataframe]=={version}
The I’m using is:
filters = [('x', '>=', time_from), ('x', '<', time_to)]
ddf = dd.read_parquet('path', columns=['x', 'y'], filters=filters, engine='fastparquet')
s = datetime.datetime.now().timestamp()
df = ddf.compute()
e = datetime.datetime.now().timestamp()
print('took', e-s, 'secs')
Test 1 <— slow
dask==2.20.0
fastparquet==0.4.0
the output is:
took 62.25287580490112 secs
Test 2 <— slow
dask==2.18.1
fastparquet==0.4.0
output:
took 62.15111708641052 secs
Test 3 <— fast
dask==1.2.2
fastparquet==0.4.0
output:
took 1.4967741966247559 secs
The data schema is as shown below:
df.head()
| x | y | |
|---|---|---|
| 0 | 2020-06-24 13:00:05 | 1003 |
| 1 | 2020-06-24 13:00:10 | 1083 |
| 2 | 2020-06-24 13:00:15 | 1247 |
| 3 | 2020-06-24 13:00:20 | 1173 |
| 4 | 2020-06-24 13:00:25 | 1260 |
df.dtypes
x datetime64[ns]
y float64
dtype: object
What happened: read_parquet function is slow on recent dask versions > 1min
What you expected to happen: faster read_parquet function. dask=1.2.2 took 1sec
Regards, C.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 44 (30 by maintainers)
Sometime fastparquet still wins 😃
Agreed, if you can produce code I can run, I’m happy to have a look. You could do this by writing many small files yourself (with random data), or perhaps writing from a dask dataframe after rechunk to a small number of rows per partition.
@rjzamora , you may find this discussion interesting in the light of https://github.com/dask/dask/pull/6346
Thanks for reporting @CrashLaker – can you provide a reproducible example? https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
It will make it much easier for us to help debug this.