scholarly: Can't scrape more than ~1000 `citedby` papers
Describe the bug I’m trying to scrape all papers that cite this paper which has ~4.5k citations, but scholarly only allows me to get the first ~1000 papers, and truncates the rest.
To Reproduce
from scholarly import ProxyGenerator, scholarly
# parse through all papers that cite this paper
original_paper = next(scholarly.search_pubs("Openai Gym"))
for i, paper in enumerate(scholarly.citedby(original_paper)):
print(i)
Expected behavior
i
reaches values of >4k.
Desktop (please complete the following information):
- Proxy service: ScraperAPI
- python version: 3.10
- OS: Ubuntu 22.04
- Version 1.7.2
Do you plan on contributing? Your response below will clarify whether the maintainers can expect you to fix the bug you reported.
- Yes, I will create a Pull Request with the bugfix if I get pointed in the right direction on how this can be fixed. I’ve looked at the source code and can’t figure out where the issue is.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 16 (6 by maintainers)
Hi, I’m using scraperAPI, I’ve actually just scraped all 4k+ citations using the
search_citedby
technique you’ve mentioned, which was how I know that the solution works. I’ll give it a shot scraping all 4k citations in one go using the new version probably in a couple weeks when the credits are expiring. 😃 Thanks a ton for the help!The
original_paper
output you posted seems to be for the paper titled ‘The arcade learning environment: An evaluation platform for general agents’ and not ‘Openai Gym’.Sure, that would be great! But I still think that that cannot handle if there are more than 1000 citations within one year. That’s going to be the case for ‘Openai gym’ paper.
Gotcha, but my question was where did you find the number in the first place? Apologies if this wasn’t clear.
Edit: found it. It’s under
citedby_url
.The documentation for
search_citedby
is similar to this one: https://github.com/scholarly-python-package/scholarly/blob/main/scholarly/_scholarly.py#L93-L112That sounds like a decent solution, I’ll give it a shot, thank you!
A solution however is to do multiple queries by restricting the year ranges and sorting them by relevance and date (because this paper has more than 1000 citations in a given year). And hopefully with those you can recover all citations. You can use
search_citedby
function and pass inyear_low
andyear_high
keywords to do this. You’ll need to get thepublication_id
for the paper, which should be inoriginal_paper
.