pmaw: slower than psaw for me

Hello,

Thanks for v0.1.0!

I just tried it and for me it was slower than psaw. Below is a standalone that compares the completion times for both. Result: pmaw took 5m16s and psaw took 2m42s.

Am I doing something wrong?

from psaw import PushshiftAPI as psawAPI
from pmaw import PushshiftAPI as pmawAPI
import os
import time


pmaw_api = pmawAPI(num_workers=os.cpu_count()*5)
psaw_api = psawAPI()


start = time.time()
test = pmaw_api.search_submissions(after=1612372114,
                              before=1612501714,
                              subreddit='wallstreetbets',
                              filter=['title', 'link_flair_text', 'selftext', 'score'])
end = time.time()
print("pmaw took " + str(end - start) + " seconds.")

start = time.time()
test_gen = psaw_api.search_submissions(after=1612372114,
                              before=1612501714,
                              subreddit='wallstreetbets',
                              filter=['title', 'link_flair_text', 'selftext', 'score'])
test = list(test_gen)
end = time.time()
print("psaw took " + str(end - start) + " seconds.")

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Will have a look at your comment tomorrow! I’m taking the rest of the evening off.

Can you share the checkpoint results which are printed when you run PMAW? I ran your code and this is how it performed for the time window specified (returning around 20,000 submissions):

PMAW (10 threads) - 463s PSAW - 547s

Note, I’m currently pulling a massive dataset from Pushshift in the background which has caused more requests to fail. Experimenting with 20+ threads appears to result in a much higher number of rejected requests due to rate-limiting from Pushshift.

I’m running your code right now and will post an update shortly. My first guess is that with too many threads there is a degradation in performance due to competition.

On a side note, if you’re using v0.1.0 can you update to v0.1.1, I released a fix today as there was an error in the time slicing, however, this only affects the data integrity not the performance.