nltk: multiprocessing and nltk don't play nicely together
Honestly, this issue is not serious as much as it is curious. I’ve discovered that when NLTK is imported, it will cause the any Python subprocess to terminates prematurely on a network call. Example code:
from multiprocessing import Process
import nltk
import time
def child_fn():
print "Fetch URL"
import urllib2
print urllib2.urlopen("https://www.google.com").read()[:100]
print "Done"
while True:
child_process = Process(target=child_fn)
child_process.start()
child_process.join()
print "Child process returned"
time.sleep(1)
Run it with NLTK imported, and you’ll see that the urlopen() call never gets executed. Comment out the import nltk
line, and it executes fine.
Why?
*edit: this is for Python 2. I haven’t tested it on 3 yet.
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Reactions: 1
- Comments: 22 (8 by maintainers)
I’m not incredibly familiar with nltk, but I did a little blind poking around to see what caused the test to pass/fail. Here’s what I had to do to the package
__init__.py
in order to make the test pass:Details (click to expand)
Interestingly, all of the disabled imports ultimately lead back to importing
tkinter
, which I think is the root cause. If I replaceimport nltk
withimport tkinter
in the test script, I get a very similar crash report, both referencing tkinter.From what I can tell, these packages directly import
tkinter
:nltk.app
nltk.draw
nltk.sem
From the above changes to the main package
__init__
, these are the problematic imports, and how they trace back to importing tkinternltk.featstruct
(sem
)nltk.grammar
(featstruct
)nltk.tree
(grammar
)nltk.chunk
(chunk.named_entity
>tree
)nltk.parse
(parse.bllip
>tree
)nltk.tag
(tag.stanford
>parse
)nltk.classify
(classify.senna
>tag
)nltk.inference
(inference.discourse
>sem
,tag
)nltk.stem
(stem.snowball
>corpus
>corpus.reader.timit
>tree
)I agree. A shorter-term solution would be to bury the tkinter imports inside the classes and methods that need tkinter, and avoid importing it by programs that don’t need it. We’ve already done something similar for numpy.
I’ve found that performing the import at function level avoids the issue.
In other words, this works:
and this doesn’t:
I think this is a serious problem if you are doing production level NLP. We are using Rq(http://python-rq.org/) workers, to run multiple NLP pipelines, wich gets silently killed when doing network calls. Hope there will be a fix soon. Thanks!
We are also experiencing this issue with the combination of: nltk, gunicorn (with nltk loaded via prefork), and flask.
Remove the nltk import, and everything works. Except nltk.
/cc @escherba
Thanks @rpkilby, that’s very helpful!
It looks like this problem https://stackoverflow.com/questions/16745507/tkinter-how-to-use-threads-to-preventing-main-event-loop-from-freezing
I think tinkter has been a pain point for us for quite some time. Perhaps, it’ll be good if we can find an alternative to it.
I think this is quite mind boggling. It might has something to do with threads handling on MacOS.
As far as I can tell, this issue seems to affect macOS. Using Python 3.6 so far,
Modified OP’s script for python3:
Output:
The subprocess quits unexpectedly, receiving similar output to what’s seen in this Stack Overflow post.
@alvations I too don’t remember which of my projects suffered from this specific issues.
I ran your code on my machine and couldn’t replicate the problem.
Python 2.7.12 nltk 3.2.1 macOS 10.12.6
@alvations It has been a long time since I found this issue. I even forgot which project base was having this issue, so I couldn’t tell you whether I still have the problem or not. Sorry!
@stevenbird I don’t think so. It’s a workaround, but it isn’t a fix.
IMHO, if importing a third-party library breaks a Python standard library component, something unholy is happening somewhere, and needs to be fixed.
I’m running into the exact same problem. I just opened a SO question that may be useful to be linked here: http://stackoverflow.com/questions/30766419/python-child-process-silently-crashes-when-issuing-an-http-request
The child process is indeed crashing silently without further notice.
I disagree with you @oxymor0n, this seems quite a serious issue to me. This basically means that whenever nltk is imported, there is no way to issue a request from a child process which can be really annoying when working with APIs.