pmaw: slower than psaw for me
Hello,
Thanks for v0.1.0!
I just tried it and for me it was slower than psaw. Below is a standalone that compares the completion times for both. Result: pmaw took 5m16s and psaw took 2m42s.
Am I doing something wrong?
from psaw import PushshiftAPI as psawAPI
from pmaw import PushshiftAPI as pmawAPI
import os
import time
pmaw_api = pmawAPI(num_workers=os.cpu_count()*5)
psaw_api = psawAPI()
start = time.time()
test = pmaw_api.search_submissions(after=1612372114,
before=1612501714,
subreddit='wallstreetbets',
filter=['title', 'link_flair_text', 'selftext', 'score'])
end = time.time()
print("pmaw took " + str(end - start) + " seconds.")
start = time.time()
test_gen = psaw_api.search_submissions(after=1612372114,
before=1612501714,
subreddit='wallstreetbets',
filter=['title', 'link_flair_text', 'selftext', 'score'])
test = list(test_gen)
end = time.time()
print("psaw took " + str(end - start) + " seconds.")
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 19 (10 by maintainers)
Will have a look at your comment tomorrow! I’m taking the rest of the evening off.
Can you share the checkpoint results which are printed when you run PMAW? I ran your code and this is how it performed for the time window specified (returning around 20,000 submissions):
PMAW (10 threads) - 463s PSAW - 547s
Note, I’m currently pulling a massive dataset from Pushshift in the background which has caused more requests to fail. Experimenting with 20+ threads appears to result in a much higher number of rejected requests due to rate-limiting from Pushshift.
I’m running your code right now and will post an update shortly. My first guess is that with too many threads there is a degradation in performance due to competition.
On a side note, if you’re using v0.1.0 can you update to v0.1.1, I released a fix today as there was an error in the time slicing, however, this only affects the data integrity not the performance.