nltk: multiprocessing and nltk don't play nicely together

Honestly, this issue is not serious as much as it is curious. I’ve discovered that when NLTK is imported, it will cause the any Python subprocess to terminates prematurely on a network call. Example code:

from multiprocessing import Process
import nltk
import time


def child_fn():
    print "Fetch URL"
    import urllib2
    print urllib2.urlopen("https://www.google.com").read()[:100]
    print "Done"


while True:
    child_process = Process(target=child_fn)
    child_process.start()
    child_process.join()
    print "Child process returned"
    time.sleep(1)

Run it with NLTK imported, and you’ll see that the urlopen() call never gets executed. Comment out the import nltk line, and it executes fine.

Why?

*edit: this is for Python 2. I haven’t tested it on 3 yet.

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Reactions: 1
  • Comments: 22 (8 by maintainers)

Most upvoted comments

I’m not incredibly familiar with nltk, but I did a little blind poking around to see what caused the test to pass/fail. Here’s what I had to do to the package __init__.py in order to make the test pass:

Details (click to expand)
###########################################################
# TOP-LEVEL MODULES
###########################################################

# Import top-level functionality into top-level namespace

from nltk.collocations import *
from nltk.decorators import decorator, memoize
# from nltk.featstruct import *
# from nltk.grammar import *
from nltk.probability import *
from nltk.text import *
# from nltk.tree import *
from nltk.util import *
from nltk.jsontags import *

# ###########################################################
# # PACKAGES
# ###########################################################

# from nltk.chunk import *
# from nltk.classify import *
# from nltk.inference import *
from nltk.metrics import *
# from nltk.parse import *
# from nltk.tag import *
from nltk.tokenize import *
from nltk.translate import *
# from nltk.sem import *
# from nltk.stem import *

# Packages which can be lazily imported
# (a) we don't import *
# (b) they're slow to import or have run-time dependencies
#     that can safely fail at run time

from nltk import lazyimport
app = lazyimport.LazyModule('nltk.app', locals(), globals())
chat = lazyimport.LazyModule('nltk.chat', locals(), globals())
corpus = lazyimport.LazyModule('nltk.corpus', locals(), globals())
draw = lazyimport.LazyModule('nltk.draw', locals(), globals())
toolbox = lazyimport.LazyModule('nltk.toolbox', locals(), globals())

# Optional loading

try:
    import numpy
except ImportError:
    pass
else:
    from nltk import cluster

# from nltk.downloader import download, download_shell
# try:
#     from six.moves import tkinter
# except ImportError:
#     pass
# else:
#     try:
#         from nltk.downloader import download_gui
#     except RuntimeError as e:
#         import warnings
#         warnings.warn("Corpus downloader GUI not loaded "
#                       "(RuntimeError during import: %s)" % str(e))

# explicitly import all top-level modules (ensuring
# they override the same names inadvertently imported
# from a subpackage)

# from nltk import ccg, chunk, classify, collocations
# from nltk import data, featstruct, grammar, help, inference, metrics
# from nltk import misc, parse, probability, sem, stem, wsd
# from nltk import tag, tbl, text, tokenize, translate, tree, treetransforms, util

Interestingly, all of the disabled imports ultimately lead back to importing tkinter, which I think is the root cause. If I replace import nltk with import tkinter in the test script, I get a very similar crash report, both referencing tkinter.

From what I can tell, these packages directly import tkinter:

  • nltk.app
  • nltk.draw
  • nltk.sem

From the above changes to the main package __init__, these are the problematic imports, and how they trace back to importing tkinter

  • nltk.featstruct (sem)
  • nltk.grammar (featstruct)
  • nltk.tree (grammar)
  • nltk.chunk (chunk.named_entity > tree)
  • nltk.parse (parse.bllip > tree)
  • nltk.tag (tag.stanford > parse)
  • nltk.classify (classify.senna > tag)
  • nltk.inference (inference.discourse > sem, tag)
  • nltk.stem (stem.snowball > corpus > corpus.reader.timit > tree)

I agree. A shorter-term solution would be to bury the tkinter imports inside the classes and methods that need tkinter, and avoid importing it by programs that don’t need it. We’ve already done something similar for numpy.

I’ve found that performing the import at function level avoids the issue.

In other words, this works:

def split(words):
    import nltk
    return nltk.word_tokenize(words)

and this doesn’t:

import nltk
def split(words):
    return nltk.word_tokenize(words)

I think this is a serious problem if you are doing production level NLP. We are using Rq(http://python-rq.org/) workers, to run multiple NLP pipelines, wich gets silently killed when doing network calls. Hope there will be a fix soon. Thanks!

The child process is indeed crashing silently without further notice.

We are also experiencing this issue with the combination of: nltk, gunicorn (with nltk loaded via prefork), and flask.

Remove the nltk import, and everything works. Except nltk.

/cc @escherba

Thanks @rpkilby, that’s very helpful!

It looks like this problem https://stackoverflow.com/questions/16745507/tkinter-how-to-use-threads-to-preventing-main-event-loop-from-freezing

I think tinkter has been a pain point for us for quite some time. Perhaps, it’ll be good if we can find an alternative to it.

I think this is quite mind boggling. It might has something to do with threads handling on MacOS.

As far as I can tell, this issue seems to affect macOS. Using Python 3.6 so far,

  • macOS 10.13 (fails)
  • Centos 7.2 (succeeds)
  • Ubuntu 16.04 (succeeds)

Modified OP’s script for python3:

from multiprocessing import Process
import nltk
import time


def child_fn():
    from urllib.request import urlopen
    print("Fetch URL")
    print(urlopen("https://www.google.com").read()[:100])
    print("Done")


child_process = Process(target=child_fn)
child_process.start()
child_process.join()
print("Child process returned")
time.sleep(1)

Output:

Fetch URL
Child process returned

The subprocess quits unexpectedly, receiving similar output to what’s seen in this Stack Overflow post.

@alvations I too don’t remember which of my projects suffered from this specific issues.

I ran your code on my machine and couldn’t replicate the problem.

Python 2.7.12 nltk 3.2.1 macOS 10.12.6

@alvations It has been a long time since I found this issue. I even forgot which project base was having this issue, so I couldn’t tell you whether I still have the problem or not. Sorry!

@stevenbird I don’t think so. It’s a workaround, but it isn’t a fix.

IMHO, if importing a third-party library breaks a Python standard library component, something unholy is happening somewhere, and needs to be fixed.

I’m running into the exact same problem. I just opened a SO question that may be useful to be linked here: http://stackoverflow.com/questions/30766419/python-child-process-silently-crashes-when-issuing-an-http-request

The child process is indeed crashing silently without further notice.

I disagree with you @oxymor0n, this seems quite a serious issue to me. This basically means that whenever nltk is imported, there is no way to issue a request from a child process which can be really annoying when working with APIs.