pmaw: slower than psaw for me

Hello,

Thanks for v0.1.0!

I just tried it and for me it was slower than psaw. Below is a standalone that compares the completion times for both. Result: pmaw took 5m16s and psaw took 2m42s.

Am I doing something wrong?

from psaw import PushshiftAPI as psawAPI
from pmaw import PushshiftAPI as pmawAPI
import os
import time


pmaw_api = pmawAPI(num_workers=os.cpu_count()*5)
psaw_api = psawAPI()


start = time.time()
test = pmaw_api.search_submissions(after=1612372114,
                              before=1612501714,
                              subreddit='wallstreetbets',
                              filter=['title', 'link_flair_text', 'selftext', 'score'])
end = time.time()
print("pmaw took " + str(end - start) + " seconds.")

start = time.time()
test_gen = psaw_api.search_submissions(after=1612372114,
                              before=1612501714,
                              subreddit='wallstreetbets',
                              filter=['title', 'link_flair_text', 'selftext', 'score'])
test = list(test_gen)
end = time.time()
print("psaw took " + str(end - start) + " seconds.")

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 19 (10 by maintainers)

Commits related to this issue

Pushshift updates (#2) * Update parameter names to match new API params * Update reading of total_available after changes to response json structure * Update checks on meta_data returned for ba... — committed to mattpodolak/pmaw by eddvrs 2 years ago

Most upvoted comments

Will have a look at your comment tomorrow! I’m taking the rest of the evening off.

gobbedy on Feb 7, 2021

Can you share the checkpoint results which are printed when you run PMAW? I ran your code and this is how it performed for the time window specified (returning around 20,000 submissions):

PMAW (10 threads) - 463s PSAW - 547s

Note, I’m currently pulling a massive dataset from Pushshift in the background which has caused more requests to fail. Experimenting with 20+ threads appears to result in a much higher number of rejected requests due to rate-limiting from Pushshift.

mattpodolak on Feb 6, 2021

I’m running your code right now and will post an update shortly. My first guess is that with too many threads there is a degradation in performance due to competition.

On a side note, if you’re using v0.1.0 can you update to v0.1.1, I released a fix today as there was an error in the time slicing, however, this only affects the data integrity not the performance.

mattpodolak on Feb 6, 2021