getpapers: HTTP error 429 - Too Many Requests
Although I pledged in https://github.com/ContentMine/getpapers/issues/156 to resist the temptation of opening a new bug for each and every HTTP error I encounter, this one happens so often that it deserves special attention.
What happened so far
In my attempt to avoid the showstopper ECONNRESET error (see https://github.com/ContentMine/getpapers/issues/155), I applied my workaround described in https://github.com/ContentMine/getpapers/issues/152 to let my own curl wrapper do the work:
I commented the original code in /usr/lib/node_modules/getpapers/lib/download.js:
// // rq = requestretry.get({url: url,
// // fullResponse: false,
// // headers: {'User-Agent': config.userAgent},
// // encoding: null
// // });
// rq = requestretry.get(Object.assign({url: url, fullResponse: false}, options));
// rq.then(handleDownload)
// rq.catch(throwErr)
and appended this:
// Alternative method: use 'exec' to run 'mycurl -o ...'
// Compose the mycurl command
var mycurl = 'mycurl -o \'' + base + rename + '\' \'' + url + '\'';
log.debug('Executing: ' + mycurl);
// excute mycurl using child_process' exec function
var child = exec(mycurl, function(err, stdout, stderr) {
// if (err) throw err;
if (err) {
log.error(err);
}
// else console.log(rename + ' downloaded to ' + base);
else {
// log.info(stdout);
console.log(stdout);
log.debug(rename + ' downloaded to ' + base);
}
});
nextUrlTask(urlQueue);
Here, mycurl is just my own curl wrapper - it catches curl errors and implements various strategies depending on the error, the server, my daily mood and other obscure factors. đ
NOTE: You will also need to add something like
// Commented. Has issues with unhandled ECONNRESET errors.
// var requestretry = require('requestretry')
var exec = require('child_process').exec
at the top of download.js.
The problem now
My above âhack aroundâ (as @tarrow calls it in https://github.com/ContentMine/getpapers/issues/152) works smoothly - but every now and then (like every 10 downloads or so), it catches a 429 Too Many Requests error:
curl: (22) The requested URL returned error: 429 Too Many Requests
(curl --location --fail --progress-bar --connect-timeout 100 --max-time 300 -C - -o PMC3747277/fulltext.pdf http://europepmc.org/articles/PMC3747277?pdf=render)
at ChildProcess.exithandler (child_process.js:206:12)
at emitTwo (events.js:106:13)
at ChildProcess.emit (events.js:191:7)
at maybeClose (internal/child_process.js:877:16)
at Socket.<anonymous> (internal/child_process.js:334:11)
at emitOne (events.js:96:13)
at Socket.emit (events.js:188:7)
at Pipe._handle.close [as _onclose] (net.js:498:12)
My curl wrapper catches this and it indeed retries a few times - but it seems that a more elaborate strategy is needed (most notably: a longer sleep interval between retries). The frequency of this error indicates that getpapers is hammering the server too fast.
I have not seen any way to throttle (the keyword phrase associated with error 429 is ârate limitâ) requests from getpapers. I therefore strongly suggest to introduce such an option - otherwise, the user has to run the script multiple times, not knowing for sure whether subsequent runs will correct failed downloads of previous runs (see https://github.com/ContentMine/getpapers/issues/156).
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 26 (10 by maintainers)
The synchronous curl-wrapper workaround above works like a charm - itâs been 1.5 days running, has handled all kinds of HTTP errors gracefully, is at 70% and still going! I suggest it as a temporary (or even permanent) solution to HTTP errors that getpapers cannot (yet) handle to anybody, as well as a hack that can help in getting more information about inner workings of the HTTP connection to the developers.
Thank you all for you tips and great help! đ
Thanks both, I think that addressing some of this energy and technology to
quickscrapewould be really valuable.getpapersis a tool to maximize the efficiency of extracting content from willing organizations. There are an increasing number of good players who expose APIs and want people to use them responsibly. (Iâve been on the Project Advisory Board or EuropePMC for 10 years and seen this from the other side - they aim to support high volumes of downloads and we work with them. @tarrow frequently contacts them with problems and they respect this and respond. Note that I and other ContentMine community have frequent contact with many repositories (arXiv, HAL, CORE, etc.) and work with them to resolve problems. But as @blahah says, itâs underfunded compared with the investment that rich publishers make in non-open systems.By contrast
quickscrapeaims to scrape web pages to which the user has legal access (I stress this). Many publishers do not provide an API and some that do have unacceptable terms and conditions.quickscrapehas been designed to take a list of URLs (or resolved DOIs) and download the content from a web site. This should only be done when you believe this is legal. The problem is that the sites often use dynamic HTML / Javascript, contain lots of âPublisher Junkâ and change frequently. If you have a list of (say) 1000 URLs then it may well contain 50 different publishers. There is a generic scraper which works well for many, but for some itâs necessary to write bespoke scrapers.A typical (and valuable) use of
quickscrapeis in conjunction with Crossref (who we are friends with). Crossref contains metadata from publishers (often messy) and the ability to query, but does not itself have the full text. So a typical workflow (which I spent a lot of time runnig last year) is :getpapers. This returns a list of URLsquickscrapeand download the papers you are legally entitled to.This is really valuable for papers which are not in a repository. Itâs a very messy business as there are frequent âhangsâ and unexpected output or none. @tarrow worked hard to improve it but there is still a lot of work to be done.
If you are interested in this PLEASE liaise with @blahah - he wrote it and knows many of the issues.
@sedimentation-fault as already mentioned, getpapers will not support subverting reasonable limits put in place by the organisations running the APIs we wrap. ArXiv is a free service for researchers, funded through philanthropic grants, and is more efficient with their spending than almost any other publishing platform. Please do not change your IP to avoid their rate-limiting - thatâs offloading the cost of your download onto them, which they have explicitly said they cannot afford (whereas you have pointed out that you think the cost is reasonable - so why not pay it?). They are not some multinational corporation that makes insane profits year after year at the cost of the public purse (like Elsevier). They push the whole of society forward through their work.
Note also that your ArXiv accounting assumes glacier storage which they will almost certainly not be using - they are most likely on the standard plan, and will have hundreds or thousands of well meaning academics (plus many less well-meaning entities) trying to scrape, crawl, or otherwise mass download their content every day.
There are many bad actors in the publishing system, and we make a point of knowing who they are and never working with them. The organisations we do work with are the ones that (a) deserve all of our support and (b) cannot afford to be exploited.
If, as seems likely on the basis of your helpful and detailed engagement, youâre interested in driving this technology forward, your insights into how to bypass unreasonable limits put in place by bad actors would be welcome over at quickscrape. Figuring out how to enable research while minimising the harm of those organisations is a huge challenge, and one weâd really welcome your help with.
Any technical details about how to bypass limits put in place by the providers supported in this repo will be removed.
The synchronous curl wrapper version I gave above was notâŠehm, letâs say it was not the best. đ For what it was, it worked impressively well!
After reading some docs, including the link given by @tarrow above, and after some experimentation and inquiry, I settled for this new version, which I am currently trying with the arxiv API:
File: /usr/lib/node_modules/getpapers/lib/download.js Start:
Function downloadURL: Comment standard (master) code:
and add:
This last part is a small gem! It shows synchronous execution of a curl wrapper (mycurl) with full, direct, immediate (including progress bars, colorsâŠ) output to the terminal, and including catching of the error in case of a non-exit code of the child process (something that some people might think is impossible in synchronous mode).
Other than I now have to wait for 1000+ files to throw a
416 Range Not Satisfiableas described in https://github.com/ContentMine/getpapers/issues/158, all seems to work fine.
Conclusion
This âtoo many requestsâ error was the result of my using a curl wrapper asynchronously and without proper serialization of HTTP requests in the callback.
(NOTE: I was forced to use my own curl wrapper due to ECONNRESET errors, see https://github.com/ContentMine/getpapers/issues/155)
Now that I do it synchronously as above, the number of child download processes has gone down from many hundreds to just a few - and so has the number of connections. Accordingly, this error has gone.
Therefore, if you donât hear from me, it means you may close this issue.
Yep, that is definitely a problem; one that we also had around a year ago. Iâm not sure how you altered your code exactly (or what is in your wrapper) but you probably want to use something like the âhandleDlâ callback we use. Youâll see around line 104 in download.js that we donât start the next item in the queue until weâve got the previous one.
I.e. getpapers as we have in master doesnât fire-and-forget; I think that is an alteration introduced by your adaption to use curl.
You probably want to implement something similar in your fork. Again, Iâm sorry I can work on this in detail today but you should look at: https://nodejs.org/api/child_process.html#child_process_child_process_exec_command_options_callback
You want to have the next curl called by the callback you pass to
execfor the current curl. It is a bit of a learning slope to go from a very procedural language to JS. Typically you donât want to use Sync stuff if you can avoid it since it blocks the whole application. Itâs nicer to use callbacks or promises to assert what order things happen in.Let me come back to the programming details: I have found the reason for the 429 errors! Look at this:
There are 712 curl wrapper instances from my box trying to download from a single provider (EUPMC in this case) right now! Thatâs horrible! What an embarrassment!
It seems that getpapers fires the downloader and forgets it (fire-and-forget, a.k.a asynchronous, or non-blocking execution)! Iâve read somewhere about the differences between exec and execSync in the child_process module. Will try that and report backâŠ
But what if âgood-faithâ content providers do not pose âreasonable limitationsâ? What if their limitations are artificial, subjective and hostile?
Personally, I consider it a casus belli if a web server (especially one that supposedly operates in the public interest) sends me a 403 at the slightest hint of automatic downloading. That really pisses me off!
I never, ever thought anything negative of arxiv.org - until yesterday. You know, in this case itâs really Either you are with us, or against us!âŠ
Sorry.
@sedimentation-fault for getpapers, we specifically want to avoid that kind of behaviour. It is intended to provide access to services provided by good-faith providers. If people want to bypass reasonable limitations in those services, they will have to do it without our help.
https://github.com/ContentMine/quickscrape on the other hand is designed to scrape where there is no reasonable service provided by the publisher. That would be a better place to provide user agent/referer spoofing, delay randomisation, and other tactics to avoid triggering blocks by bad-faith providers.
And I should add, all those things can easily be done in nodeJS - still no need for curl đ
I closed my above post with:
âWhat do you mean by âsome moreâ? You are just showing off!â - you might think.
No I am not. Here are a few hints:
Thatâs my notion of âdonât wake up the watchdogs!â. đ
While I would like to work on this today; realistically it is going to be a few days before I have time.
I think you should remember that youâre not starting from a âcleanâ point each time. because arxiv have probably now have temporally marked you as a little bit non-compliant.
We should try and work out what is an acceptable rate limit for arXiv. I have looked and canât see one published. They actually recommend bulk getting papers from a âdownloader paysâ S3 bucket. Iâm not sure what they consider bulk. See: https://arxiv.org/help/bulk_data
Might be worth contacting them via their contact us page and finding out what they think is acceptable.
Adding rate-limit capability implies being able to sleep for a few seconds to slow things down. Thus, a starting point is to program a sleep function - something I was appalled to learn that it is far from trivial in JavaScript! đ±
The following might be of help in this direction: JavaScript version of sleep().