smart_open: Reading S3 files becomes slow after 1.5.4

As mentioned earlier in #74, it appears that the reading speed is very slow after 1.5.4.

$ pyvenv-3.4 env
$ source env/bin/activate
$ pip install smart_open==1.5.3 tqdm ipython
$ ipython
from tqdm import tqdm
from smart_open import smart_open
for _ in tqdm(smart_open('s3://xxxxx', 'rb')):
    pass

2868923it [00:53, 53888.94it/s]

$ pyvenv-3.4 env
$ source env/bin/activate
$ pip install smart_open==1.5.4 tqdm ipython
$ ipython
from tqdm import tqdm
from smart_open import smart_open
for _ in tqdm(smart_open('s3://xxxxx', 'rb')):
    pass

8401it [00:18, 442.64it/s] (too slow so I could not wait for it to finish.)

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 26 (4 by maintainers)

Commits related to this issue

Most upvoted comments

@otamachan Thank you for your suggestion! I had a closer look at your implementation and finally realized what the remaining problem was. It isn’t necessary to go back to boto to achieve the same performance: we can do the same thing with the newer boto3.

https://github.com/RaRe-Technologies/smart_open/pull/157

Thank you for pointing me the right way. どうもありがとうございました!

@otamachan I’ll release 1.5.6 but slightly later (after setup integration testing contour).

@appierys This was fixed by fbc82cc04bf92e8c588c49c43ff5aff8234ea87d and the commit is only in the master branch. Please try the master branch. @menshikh-iv I would appreciate if you would release 1.5.6 because the performance issue is not fully resolved by 1.5.5 . Thanks in advance.

@mpenkov I’m happy that the performance is now back by your right implementation and I could be of help. I really appreciate your great effort to improve this wonderful library!

こちらこそありがとうございます!

@mpenkov Thanks for your comment. I agree with you. I just wanted to confirm whether threading could improve the performance or not. For the reproduce.py test, the performance was not so improved even when threading is introduced. It should be avoided for multiprocessing applications as you said.

I assume that 1.5.3 uses https://github.com/boto/boto/blob/develop/boto/s3/key.py#L308 . This is one GET request and reads the body asynchronously. But this is not seekable.

On the other hand, 1.5.4 uses https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/s3.py#L89 . This separates the GET request into several parts and waits until the reading is done. This decreases the throughput but is seekable.