tornado: getting many HTTP 599 errors for valid urls
I’m using tornado AsyncHTTPClient with the following code, I basically call the scrape function with a url generator list that contains 10K urls. I expect to have maximum 50 concurrent requests at any time, which doesn’t seem to work as the entire process ends in about 2 minutes.
I got ~200 valid responses and ~9000 HTTP 599 error. I checked many urls that threw this error and they do load in less than 10 sec’, I’m able to reach most urls using urllib2/requests with a smaller timeout (5 seconds).
All requests sent to different servers, running from ubuntu with python 2.7.3 & tornado version = “4.1”.
I suspect that something is wrong as I can fetch most urls using other (blocking) libraries.
import tornado.ioloop
import tornado.httpclient
class Fetcher(object):
def __init__(self, ioloop):
self.ioloop = ioloop
self.client = tornado.httpclient.AsyncHTTPClient(io_loop=ioloop, max_clients=50)
self.client.configure(None, defaults=dict(user_agent="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36",
connect_timeout=20,request_timeout=20, validate_cert=False))
def fetch(self, url):
self.client.fetch(url, self.handle_response)
@property
def active(self):
"""True if there are active fetching happening"""
return len(self.client.active) != 0
def handle_response(self, response):
if response.error:
print "Error: %s, time: %s, url: %s" % (response.error, response.time_info, response.effective_url)
else:
# print "clients %s" % self.client.active
print "Got %d bytes" % (len(response.body))
if not self.active:
self.ioloop.stop()
def scrape(urls):
ioloop = tornado.ioloop.IOLoop.instance()
ioloop.add_callback(scrapeEverything, *urls)
ioloop.start()
def scrapeEverything(*urls):
fetcher = Fetcher(tornado.ioloop.IOLoop.instance())
for url in urls:
fetcher.fetch(url)
if __name__ == '__main__':
scrape()
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Comments: 20 (3 by maintainers)
I just tested it and it works great. Here’s @Dalloriam example, within my full working example for posterity. I’ve moved over to this variation because I like self-contained classes. 😉
@Dalloriam like the idea, though I implemented the flush version. Care to finish yours off to get the run_request() in there to remove duplication?
For those interested in the flush alternative, here’s a full working example:
As @akellehe pointed out to me, using the flush method allows for async callbacks making sure you’re never bound by a single response, but always filling the queue to process at the maximum capacity. Though I am very interested in the solution @Dalloriam has.
@akellehe I just stumbled on your BacklogClient implementation (great idea by the way) and for the sake of completeness, I suggest daisy-chaining requests when the callback is executed instead of implementing a
flush()method, as this guarantees the queue will be emptied. Something like this:You could, of course, add an additional
run_request()method to get rid of the duplication between thefetch()and__get_callback()methods.I realize this might be useful for others experiencing the same problem. This solution works by only realizing those 599s when there is a timeout on the server/network not, for example, when the client becomes CPU bound. Here’s an example:
Thanks, guys, this is a helpful discussion. I didn’t realize time on the client’s queue counted against the
request_timeout. With that in mind I created a separate queue to manage a backlog of requests as you guys mentioned and problem solved 🚀 🚀Updated examples with
if not self.backlog.backlog and self.backlog.concurrent_requests == 0:to ensure the last thread has completed before stopping the ioloop.@dovy Here is my implementation without
flush():One could adapt your example by replacing
by
(altough I haven’t tested it). Cheers!
Experimented with and used the
BacklogClientwith success. 👍You’re starting all the fetches at once but telling AsyncHTTPClient to give up and return a 599 Timeout if it can’t complete the request in 20 seconds (the request_timeout option). You need to either increase request_timeout to the amount of time you’re willing to wait for the response (including time spent waiting in the queue), or maintain your own queue and feed urls into AsyncHTTPClient gradually (the queue and semaphore classes that are being introduced in the upcoming Tornado 4.2 can help here; until then you can use Toro: http://toro.readthedocs.org/en/stable/examples/web_spider_example.html)