aws-sdk-pandas: s3.read_csv slow with chunksize

Describe the bug

I’m not sure the s3.read_csv function really reads a csv in chunks. I noticed that for relatively big dataframes, running the following instruction takes an abnormally large amount of time:

it = wr.s3.read_csv(uri, chunksize=chunksize)

I think the chunksize parameter is ignored.

To Reproduce

I’m running awswrangler==1.1.2 (installed with poetry) but I quickly tested 1.6.3 and it seems the issue is there too.

from itertools import islice
from smart_open import open as sopen
import awswrangler as wr
import pandas as pd
from io import StringIO

uri = ""

CHUNKSIZE = 100

def manual_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
    with sopen(uri, "r") as f:
        chunk = "".join(islice(f, chunksize))
        df = pd.read_csv(StringIO(chunk))

    return df


def s3_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
    it = wr.s3.read_csv(uri, chunksize=chunksize)

    df = next(it)

    return df

I compared two different ways to load the first 100 lines of a “big” (1.2 GB) dataframe from S3:

  • with the equivalent of open(file, "r") and then lazily parsing the lines as a CSV string
  • using s3.read_csv with chunksize=100.

Results:

In [3]: %time manual_chunking(uri)
CPU times: user 173 ms, sys: 22.9 ms, total: 196 ms
Wall time: 581 ms

In [8]: %time s3_chunking(uri)
CPU times: user 8.73 s, sys: 7.82 s, total: 16.5 s
Wall time: 3min 59s

In [9]: %time wr.s3.read_csv(uri)
CPU times: user 27.3 s, sys: 9.48 s, total: 36.7 s
Wall time: 3min 38s

The timings are more or less reproducible. After comparing the last two timings, I suspect that the chunksize parameter is ignored. It takes more or less the same amount of time to load 100 lines of the file than to read the full file.

Is it expected?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (15 by maintainers)

Commits related to this issue

Most upvoted comments

Released in 1.7.0!

I read that it “will override the regular default arguments configured in the function signature.” though?

Yes, it was the only original behavior, but now I will update and mention that it can also set internal/not exposed configurations. Does it make sense?

Sounds great, thanks again.

Out of curiosity, how would I change the block size of s3fs? It’s probably not so great to pass it as a parameter to read_csv, but is there some sort of setting that I could set for an entire script?

Thanks for testing!

Now I’ve decreased the buffer size from 32 to 8 MB. I think it will result in a better experience.

Btw, the smart_open default buffer size is 128 KB. That’s why the manual_chunking is so fast for this tiny chunk. But is always a matter of trade-off how we discussed above, so I’m happy with the current implementation too.

That makes sense indeed, I didn’t know s3fs was “that smart”.

Yes what you suggest makes sense. However it might cause other problems, like: how can you make sure that the smaller blocksize used for read_csv will be enough to accomodate for the chunksize? But this can probably be solved.

Yep, it could happen, but s3fs already solves this “pagination” mechanism. So in the worst case Wrangler will need to do more than 1 request to fetch a chunk, what is acceptable and better than we have today. By now I prefer to stay simple and accept the trade-off instead of implement some kind of “block size predictor”. Don’t you agree?