scholarly: Can't scrape more than ~1000 `citedby` papers

Describe the bug I’m trying to scrape all papers that cite this paper which has ~4.5k citations, but scholarly only allows me to get the first ~1000 papers, and truncates the rest.

To Reproduce

from scholarly import ProxyGenerator, scholarly
# parse through all papers that cite this paper
original_paper = next(scholarly.search_pubs("Openai Gym"))
for i, paper in enumerate(scholarly.citedby(original_paper)):
    print(i)

Expected behavior i reaches values of >4k.

Desktop (please complete the following information):

  • Proxy service: ScraperAPI
  • python version: 3.10
  • OS: Ubuntu 22.04
  • Version 1.7.2

Do you plan on contributing? Your response below will clarify whether the maintainers can expect you to fix the bug you reported.

  • Yes, I will create a Pull Request with the bugfix if I get pointed in the right direction on how this can be fixed. I’ve looked at the source code and can’t figure out where the issue is.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 16 (6 by maintainers)

Most upvoted comments

Hi, I’m using scraperAPI, I’ve actually just scraped all 4k+ citations using the search_citedby technique you’ve mentioned, which was how I know that the solution works. I’ll give it a shot scraping all 4k citations in one go using the new version probably in a couple weeks when the credits are expiring. 😃 Thanks a ton for the help!

Is it not 10877906446132695264 instead?

The original_paper output you posted seems to be for the paper titled ‘The arcade learning environment: An evaluation platform for general agents’ and not ‘Openai Gym’.

In any case, would you like me to make a PR fixing citedby to be able to handle papers with more than 1k citations?

Sure, that would be great! But I still think that that cannot handle if there are more than 1000 citations within one year. That’s going to be the case for ‘Openai gym’ paper.

Gotcha, but my question was where did you find the number in the first place? Apologies if this wasn’t clear.

Edit: found it. It’s under citedby_url.

That sounds like a decent solution, I’ll give it a shot, thank you!

A solution however is to do multiple queries by restricting the year ranges and sorting them by relevance and date (because this paper has more than 1000 citations in a given year). And hopefully with those you can recover all citations. You can use search_citedby function and pass in year_low and year_high keywords to do this. You’ll need to get the publication_id for the paper, which should be in original_paper.