dedupe: childProcessError when running deduce.partition in a Jupyter Notebook on OSX
I am running dedupe in a Jupyter notebook on Mac. When I run this line of code:
groups = deduper.partition(data, threshold=.7)
I get this error at the same place each time, 360000:
INFO:dedupe.blocking:340000, 10.0376252 seconds
INFO:dedupe.blocking:350000, 10.3528052 seconds
INFO:dedupe.blocking:360000, 10.6705842 seconds
---------------------------------------------------------------------------
ChildProcessError Traceback (most recent call last)
in
----> 1 groups = deduper.partition(data, threshold=.7)
~/opt/anaconda3/lib/python3.7/site-packages/dedupe/api.py in partition(self, data, threshold)
168 """
169 pairs = self.pairs(data)
--> 170 pair_scores = self.score(pairs)
171 clusters = self.cluster(pair_scores, threshold)
172
~/opt/anaconda3/lib/python3.7/site-packages/dedupe/api.py in score(self, pairs)
104 self.data_model,
105 self.classifier,
--> 106 self.num_cores)
107 except RuntimeError:
108 raise RuntimeError('''
~/opt/anaconda3/lib/python3.7/site-packages/dedupe/core.py in scoreDuplicates(record_pairs, data_model, classifier, num_cores)
247 result = result_queue.get()
248 if isinstance(result, Exception):
--> 249 raise ChildProcessError
250
251 if result:
ChildProcessError:
it looks like the num_cores setting had something to do with it, I’ve tried with that setting set to None, 1, and 2 and all have the same outcome.
I found this issue, which sounded somewhat familiar. So in case it helps here is the output of:
import numpy
print(numpy.__config__.__dict__)
{'__name__': 'numpy.__config__', '__doc__': None, '__package__': 'numpy', '__loader__': <_frozen_importlib_external.SourceFileLoader object at 0x7fa3571b59d0>, '__spec__': ModuleSpec(name='numpy.__config__', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7fa3571b59d0>, origin='/Users/calebkeller/opt/anaconda3/lib/python3.7/site-packages/numpy/__config__.py'), '__file__': '/Users/calebkeller/opt/anaconda3/lib/python3.7/site-packages/numpy/__config__.py', '__cached__': '/Users/calebkeller/opt/anaconda3/lib/python3.7/site-packages/numpy/__pycache__/__config__.cpython-37.pyc', '__builtins__': {'__name__': 'builtins', '__doc__': "Built-in functions, exceptions, and other objects.\n\nNoteworthy: None is the `nil' object; Ellipsis represents `...' in slices.", '__package__': '', '__loader__': <class '_frozen_importlib.BuiltinImporter'>, '__spec__': ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>), '__build_class__': <built-in function __build_class__>, '__import__': <built-in function __import__>, 'abs': <built-in function abs>, 'all': <built-in function all>, 'any': <built-in function any>, 'ascii': <built-in function ascii>, 'bin': <built-in function bin>, 'breakpoint': <built-in function breakpoint>, 'callable': <built-in function callable>, 'chr': <built-in function chr>, 'compile': <built-in function compile>, 'delattr': <built-in function delattr>, 'dir': <built-in function dir>, 'divmod': <built-in function divmod>, 'eval': <built-in function eval>, 'exec': <built-in function exec>, 'format': <built-in function format>, 'getattr': <built-in function getattr>, 'globals': <built-in function globals>, 'hasattr': <built-in function hasattr>, 'hash': <built-in function hash>, 'hex': <built-in function hex>, 'id': <built-in function id>, 'input': <bound method Kernel.raw_input of <ipykernel.ipkernel.IPythonKernel object at 0x7fa356643ed0>>, 'isinstance': <built-in function isinstance>, 'issubclass': <built-in function issubclass>, 'iter': <built-in function iter>, 'len': <built-in function len>, 'locals': <built-in function locals>, 'max': <built-in function max>, 'min': <built-in function min>, 'next': <built-in function next>, 'oct': <built-in function oct>, 'ord': <built-in function ord>, 'pow': <built-in function pow>, 'print': <built-in function print>, 'repr': <built-in function repr>, 'round': <built-in function round>, 'setattr': <built-in function setattr>, 'sorted': <built-in function sorted>, 'sum': <built-in function sum>, 'vars': <built-in function vars>, 'None': None, 'Ellipsis': Ellipsis, 'NotImplemented': NotImplemented, 'False': False, 'True': True, 'bool': <class 'bool'>, 'memoryview': <class 'memoryview'>, 'bytearray': <class 'bytearray'>, 'bytes': <class 'bytes'>, 'classmethod': <class 'classmethod'>, 'complex': <class 'complex'>, 'dict': <class 'dict'>, 'enumerate': <class 'enumerate'>, 'filter': <class 'filter'>, 'float': <class 'float'>, 'frozenset': <class 'frozenset'>, 'property': <class 'property'>, 'int': <class 'int'>, 'list': <class 'list'>, 'map': <class 'map'>, 'object': <class 'object'>, 'range': <class 'range'>, 'reversed': <class 'reversed'>, 'set': <class 'set'>, 'slice': <class 'slice'>, 'staticmethod': <class 'staticmethod'>, 'str': <class 'str'>, 'super': <class 'super'>, 'tuple': <class 'tuple'>, 'type': <class 'type'>, 'zip': <class 'zip'>, '__debug__': True, 'BaseException': <class 'BaseException'>, 'Exception': <class 'Exception'>, 'TypeError': <class 'TypeError'>, 'StopAsyncIteration': <class 'StopAsyncIteration'>, 'StopIteration': <class 'StopIteration'>, 'GeneratorExit': <class 'GeneratorExit'>, 'SystemExit': <class 'SystemExit'>, 'KeyboardInterrupt': <class 'KeyboardInterrupt'>, 'ImportError': <class 'ImportError'>, 'ModuleNotFoundError': <class 'ModuleNotFoundError'>, 'OSError': <class 'OSError'>, 'EnvironmentError': <class 'OSError'>, 'IOError': <class 'OSError'>, 'EOFError': <class 'EOFError'>, 'RuntimeError': <class 'RuntimeError'>, 'RecursionError': <class 'RecursionError'>, 'NotImplementedError': <class 'NotImplementedError'>, 'NameError': <class 'NameError'>, 'UnboundLocalError': <class 'UnboundLocalError'>, 'AttributeError': <class 'AttributeError'>, 'SyntaxError': <class 'SyntaxError'>, 'IndentationError': <class 'IndentationError'>, 'TabError': <class 'TabError'>, 'LookupError': <class 'LookupError'>, 'IndexError': <class 'IndexError'>, 'KeyError': <class 'KeyError'>, 'ValueError': <class 'ValueError'>, 'UnicodeError': <class 'UnicodeError'>, 'UnicodeEncodeError': <class 'UnicodeEncodeError'>, 'UnicodeDecodeError': <class 'UnicodeDecodeError'>, 'UnicodeTranslateError': <class 'UnicodeTranslateError'>, 'AssertionError': <class 'AssertionError'>, 'ArithmeticError': <class 'ArithmeticError'>, 'FloatingPointError': <class 'FloatingPointError'>, 'OverflowError': <class 'OverflowError'>, 'ZeroDivisionError': <class 'ZeroDivisionError'>, 'SystemError': <class 'SystemError'>, 'ReferenceError': <class 'ReferenceError'>, 'MemoryError': <class 'MemoryError'>, 'BufferError': <class 'BufferError'>, 'Warning': <class 'Warning'>, 'UserWarning': <class 'UserWarning'>, 'DeprecationWarning': <class 'DeprecationWarning'>, 'PendingDeprecationWarning': <class 'PendingDeprecationWarning'>, 'SyntaxWarning': <class 'SyntaxWarning'>, 'RuntimeWarning': <class 'RuntimeWarning'>, 'FutureWarning': <class 'FutureWarning'>, 'ImportWarning': <class 'ImportWarning'>, 'UnicodeWarning': <class 'UnicodeWarning'>, 'BytesWarning': <class 'BytesWarning'>, 'ResourceWarning': <class 'ResourceWarning'>, 'ConnectionError': <class 'ConnectionError'>, 'BlockingIOError': <class 'BlockingIOError'>, 'BrokenPipeError': <class 'BrokenPipeError'>, 'ChildProcessError': <class 'ChildProcessError'>, 'ConnectionAbortedError': <class 'ConnectionAbortedError'>, 'ConnectionRefusedError': <class 'ConnectionRefusedError'>, 'ConnectionResetError': <class 'ConnectionResetError'>, 'FileExistsError': <class 'FileExistsError'>, 'FileNotFoundError': <class 'FileNotFoundError'>, 'IsADirectoryError': <class 'IsADirectoryError'>, 'NotADirectoryError': <class 'NotADirectoryError'>, 'InterruptedError': <class 'InterruptedError'>, 'PermissionError': <class 'PermissionError'>, 'ProcessLookupError': <class 'ProcessLookupError'>, 'TimeoutError': <class 'TimeoutError'>, 'open': <built-in function open>, 'copyright': Copyright (c) 2001-2019 Python Software Foundation.
All Rights Reserved.
Copyright (c) 2000 BeOpen.com.
All Rights Reserved.
Copyright (c) 1995-2001 Corporation for National Research Initiatives.
All Rights Reserved.
Copyright (c) 1991-1995 Stichting Mathematisch Centrum, Amsterdam.
All Rights Reserved., 'credits': Thanks to CWI, CNRI, BeOpen.com, Zope Corporation and a cast of thousands
for supporting Python development. See www.python.org for more information., 'license': Type license() to see the full license text, 'help': Type help() for interactive help, or help(object) for help about object., '__IPYTHON__': True, 'display': <function display at 0x7fa355065830>, '__pybind11_internals_v3_clang_libcpp_cxxabi1002__': <capsule object NULL at 0x7fa359712db0>, 'get_ipython': <bound method InteractiveShell.get_ipython of <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fa356643b90>>}, '__all__': ['get_info', 'show'], 'os': <module 'os' from '/Users/calebkeller/opt/anaconda3/lib/python3.7/os.py'>, 'sys': <module 'sys' (built-in)>, 'extra_dll_dir': '/Users/calebkeller/opt/anaconda3/lib/python3.7/site-packages/numpy/.libs', 'blas_mkl_info': {'libraries': ['mkl_rt', 'pthread'], 'library_dirs': ['/Users/calebkeller/opt/anaconda3/lib'], 'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)], 'include_dirs': ['/Users/calebkeller/opt/anaconda3/include']}, 'blas_opt_info': {'libraries': ['mkl_rt', 'pthread'], 'library_dirs': ['/Users/calebkeller/opt/anaconda3/lib'], 'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)], 'include_dirs': ['/Users/calebkeller/opt/anaconda3/include']}, 'lapack_mkl_info': {'libraries': ['mkl_rt', 'pthread'], 'library_dirs': ['/Users/calebkeller/opt/anaconda3/lib'], 'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)], 'include_dirs': ['/Users/calebkeller/opt/anaconda3/include']}, 'lapack_opt_info': {'libraries': ['mkl_rt', 'pthread'], 'library_dirs': ['/Users/calebkeller/opt/anaconda3/lib'], 'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)], 'include_dirs': ['/Users/calebkeller/opt/anaconda3/include']}, 'get_info': <function get_info at 0x7fa3571b07a0>, 'show': <function show at 0x7fa3571b0dd0>}
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (6 by maintainers)
Hi @fjsj, that worked. Thank you so much for all your help. Much appreciated! 😃
Folks, remember to try with a single core, otherwise your real error will be masked by
ChildProcessError.And if the empty strings in your dataset are not blocked together, they won’t be scored and you won’t see the
ZeroDivisionErrorerror.No, there were empty strings in the smaller dataset also.
@wrathagom Do you have empty strings in your textual data? If so, replace them with None and the affinegap error should be gone. Using num_cores != 0 is hiding the real error which is the affinegap one.