requests: Requests seems to stucks when using with futures.ThreadPoolExecutor
Hello,
I’m using Python 2.7.9 with futures (3.0.3) and requests (2.7.0) on Debian (also tested on Win8 and results are same).
The problem is, Requests doesn’t timeout and stucks, so it seems my threads never finish their jobs and stops processing queue.
I’m trying to make a multi-threaded web crawler and I’m fetching to-be-crawled URLs from frontier (which returns a json list of domains) and populating a queue with them.
After this I’m populating Thread Pool with the code below
while not url_queue.empty():
queue_data = url_queue.get()
task_pool.submit(processItem, queue_data)
In processItem() function, I’m fetching url with get_data() and marking the queue item with task_done()
My get_data() function is as follows
def get_data(fqdn):
try:
response = requests.get("http://"+fqdn, headers=headers, allow_redirects=True, timeout=3)
if response.status_code == requests.codes.ok:
result = response.text
else:
result = ""
except requests.exceptions.RequestException as e:
print "ERROR OCCURED:"
print fqdn
print e.message
result = ""
return result
If I mark get_data() as comment in processItem(), all threads and queue works fine. If I uncomment it, works fine for most of requests but stucking for some and that affects all queue and script because queue.join() waits for threads to complete requests. I suppose it’s a bug of requests module as everything works fine without calling get_data() and as requests doesn’t time out the GET request.
Any help will be greatly appreciated… Thank you very much…
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Reactions: 2
- Comments: 24 (11 by maintainers)
@metrue to maintain a thread-safe/multiprocess-safe queue, you can use the standard library’s
Queueimplementation. If you’re on Python 2if you’re on Python 3
If you are using a process pool executor you must not use a Session that is shared across those processes.
Same issue here.
Also developing a web crawler intended to process a continuous stream of URLs.
My code behaviour is something like the following:
I wasn’t able to reproduce this very same error for this small list of URLs, but on my production environment (after some time running - let’s say 2~3 hours), the log messages starts looking this:
I checked ThreadPoolExecutor’s implementation and I’m pretty convinced the problem is NOT related to it. The code just seems to get stuck on line 55:
edit: by “the issue is not related to ThreadPoolExecutor”, I mean: it doesn’t matter if callback() raises an exception or not; it’s supposed to work just fine. The thing is that _WorkItem:run() method never stops.
edit 2: python 2.7