aws-sdk-pandas: s3.read_csv slow with chunksize
Describe the bug
I’m not sure the s3.read_csv
function really reads a csv in chunks. I noticed that for relatively big dataframes, running the following instruction takes an abnormally large amount of time:
it = wr.s3.read_csv(uri, chunksize=chunksize)
I think the chunksize
parameter is ignored.
To Reproduce
I’m running awswrangler==1.1.2 (installed with poetry) but I quickly tested 1.6.3 and it seems the issue is there too.
from itertools import islice
from smart_open import open as sopen
import awswrangler as wr
import pandas as pd
from io import StringIO
uri = ""
CHUNKSIZE = 100
def manual_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
with sopen(uri, "r") as f:
chunk = "".join(islice(f, chunksize))
df = pd.read_csv(StringIO(chunk))
return df
def s3_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
it = wr.s3.read_csv(uri, chunksize=chunksize)
df = next(it)
return df
I compared two different ways to load the first 100 lines of a “big” (1.2 GB) dataframe from S3:
- with the equivalent of
open(file, "r")
and then lazily parsing the lines as a CSV string - using
s3.read_csv
withchunksize=100
.
Results:
In [3]: %time manual_chunking(uri)
CPU times: user 173 ms, sys: 22.9 ms, total: 196 ms
Wall time: 581 ms
In [8]: %time s3_chunking(uri)
CPU times: user 8.73 s, sys: 7.82 s, total: 16.5 s
Wall time: 3min 59s
In [9]: %time wr.s3.read_csv(uri)
CPU times: user 27.3 s, sys: 9.48 s, total: 36.7 s
Wall time: 3min 38s
The timings are more or less reproducible. After comparing the last two timings, I suspect that the chunksize
parameter is ignored. It takes more or less the same amount of time to load 100 lines of the file than to read the full file.
Is it expected?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (15 by maintainers)
Commits related to this issue
- Deacrease the s3fs buffer to 8MB for chunked reads and more. #324 — committed to aws/aws-sdk-pandas by igorborgest 4 years ago
- Add s3fs_block_size config and more. #324 — committed to aws/aws-sdk-pandas by igorborgest 4 years ago
- Deacrease the s3fs buffer to 8MB for chunked reads and more. #324 — committed to aws/aws-sdk-pandas by igorborgest 4 years ago
- Add s3fs_block_size config and more. #324 — committed to aws/aws-sdk-pandas by igorborgest 4 years ago
Released in 1.7.0!
Yes, it was the only original behavior, but now I will update and mention that it can also set internal/not exposed configurations. Does it make sense?
Sounds great, thanks again.
Out of curiosity, how would I change the block size of s3fs? It’s probably not so great to pass it as a parameter to
read_csv
, but is there some sort of setting that I could set for an entire script?Thanks for testing!
Now I’ve decreased the buffer size from 32 to 8 MB. I think it will result in a better experience.
Btw, the smart_open default buffer size is 128 KB. That’s why the
manual_chunking
is so fast for this tiny chunk. But is always a matter of trade-off how we discussed above, so I’m happy with the current implementation too.That makes sense indeed, I didn’t know s3fs was “that smart”.
Yep, it could happen, but
s3fs
already solves this “pagination” mechanism. So in the worst case Wrangler will need to do more than 1 request to fetch a chunk, what is acceptable and better than we have today. By now I prefer to stay simple and accept the trade-off instead of implement some kind of “block size predictor”. Don’t you agree?