dask: Read_Parquet too slow between versions 1.* and 2.*

Hi all,

I’d like to report a weird slow read_parquet behavior between dask versions. On all examples I’ll be using:

CentOS 7.7
Python 3.6
Fastparquet 0.4.0 (Installed with pip)

Dask versions were also installed with pip. pip3 install dask=={version} dask[dataframe]=={version}

The I’m using is:

filters = [('x', '>=', time_from), ('x', '<', time_to)]
ddf = dd.read_parquet('path', columns=['x', 'y'], filters=filters, engine='fastparquet')
s = datetime.datetime.now().timestamp()
df = ddf.compute()
e = datetime.datetime.now().timestamp()
print('took', e-s, 'secs')

Test 1 <— slow dask==2.20.0 fastparquet==0.4.0 the output is: took 62.25287580490112 secs

Test 2 <— slow dask==2.18.1 fastparquet==0.4.0 output: took 62.15111708641052 secs

Test 3 <— fast dask==1.2.2 fastparquet==0.4.0 output: took 1.4967741966247559 secs

The data schema is as shown below: df.head()

	x	y
0	2020-06-24 13:00:05	1003
1	2020-06-24 13:00:10	1083
2	2020-06-24 13:00:15	1247
3	2020-06-24 13:00:20	1173
4	2020-06-24 13:00:25	1260

df.dtypes

x    datetime64[ns]
y           float64
dtype: object

What happened: read_parquet function is slow on recent dask versions > 1min

What you expected to happen: faster read_parquet function. dask=1.2.2 took 1sec

Regards, C.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 44 (30 by maintainers)

Most upvoted comments

engine=‘pyarrow’ took avg 13secs engine=‘fastparquet’ too avg 11secs

Sometime fastparquet still wins 😃

martindurant on Feb 9, 2021

Agreed, if you can produce code I can run, I’m happy to have a look. You could do this by writing many small files yourself (with random data), or perhaps writing from a dask dataframe after rechunk to a small number of rows per partition.

@rjzamora , you may find this discussion interesting in the light of https://github.com/dask/dask/pull/6346

martindurant on Jul 9, 2020

Thanks for reporting @CrashLaker – can you provide a reproducible example? https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

It will make it much easier for us to help debug this.

gforsyth on Jul 8, 2020