papermill: zmq.error.ZMQError: Address already in use, when running multiprocessing with multiple notebooks using papermill

I am using the papermill library to run multiple notebooks using multiprocessing simultaneously.

This is occurring on Python 3.6.6, Red Hat 4.8.2-15 within a Docker container.

However when I run the python script, about 5% of my notebooks do not work immediately (No Jupyter Notebook cells run) due to me receiving this error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-124>", line 2, in initialize
  File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 469, in initialize
    self.init_sockets()
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 238, in init_sockets
    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

along with:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "main.py", line 77, in run_papermill
    pm.execute_notebook(notebook, output_path, parameters=config)
  File "/opt/conda/lib/python3.6/site-packages/papermill/execute.py", line 104, in execute_notebook
    **engine_kwargs
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 49, in execute_notebook_with_engine
    return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 304, in execute_notebook
    nb = cls.execute_managed_notebook(nb_man, kernel_name, log_output=log_output, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 372, in execute_managed_notebook
    preprocessor.preprocess(nb_man, safe_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/preprocess.py", line 20, in preprocess
    with self.setup_preprocessor(nb_man.nb, resources, km=km):
  File "/opt/conda/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 345, in setup_preprocessor
    self.km, self.kc = self.start_new_kernel(**kwargs)
  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 296, in start_new_kernel
    kc.wait_for_ready(timeout=self.startup_timeout)
  File "/opt/conda/lib/python3.6/site-packages/jupyter_client/blocking/client.py", line 104, in wait_for_ready
    raise RuntimeError('Kernel died before replying to kernel_info')
RuntimeError: Kernel died before replying to kernel_info

Please help me with this problem, as I have scoured the web trying different solutions, none that have worked for my case so far.

This error rate of 5% occurs regardless of the number of notebooks I run simultaneously or the number of cores on my computer which makes it extra curious.

I have tried changing the start method and updating the libraries but to no avail.

The version of my libraries are:

papermill==1.2.1
ipython==7.14.0
jupyter-client==6.1.3
pyzmq==17.1.2

Thank you!

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 3
Comments: 18 (8 by maintainers)

Most upvoted comments

A small improvement on @kevin-bates IPCKernelManager class to make sure the runtime directory exists:

import os
import uuid

from jupyter_client.manager import KernelManager
from jupyter_core.paths import jupyter_runtime_dir


class IPCKernelManager(KernelManager):
    def __init__(self, *args, **kwargs):
        kernel_id = str(uuid.uuid4())
        os.makedirs(jupyter_runtime_dir(), exist_ok=True)
        connection_file = os.path.join(jupyter_runtime_dir(), f"kernel-{kernel_id}.json")
        super().__init__(*args, transport = "ipc", kernel_id=kernel_id, connection_file=connection_file, **kwargs)

miltondp on Nov 21, 2022

@mlucool - thanks for the update.

the runtime dir did not work great on NFS for the above test. I saw race conditions resulting in RuntimeError: Kernel didn’t respond in 60 seconds. Moving this to local disk solves any issues.

Each of the jupyter-based directories (runtime, config, data) can be “redirected” via envs. Namely, JUPYTER_RUNTIME_DIR, JUPYTER_CONFIG_DIR, and JUPYTER_DATA_DIR, respectively. In situations where the files require a high degree of access (like the IPC files), you’re probably better off pointing the corresponding env to a local directory, which would also benefit other jupyter-based applications in that particular configuration and allow the same code to run irrespective of the underlying env values.

kevin-bates on Jun 3, 2022

@kevin-bates I thought you may be interested in this from a protocol POV: the runtime dir did not work great on NFS for the above test. I saw race conditions resulting in RuntimeError: Kernel didn't respond in 60 seconds. Moving this to local disk solves any issues.

mlucool on Jun 1, 2022

That worked great - thanks!! This seems to fully solve the issue if you can use IPC.

mlucool on Jun 1, 2022

I’m not familiar with the IPC transport, but looking into this for a bit, it appears there needs to be better “randomization” happening when checking for the ipc “port” existence.

In the case of “transport == ipc” the ip value is either “kernel-ipc” or “kernel-kernel_id-ipc” where the latter is really the value of self.connection_file sans the .json suffix (and would provide sufficient randomization). However, because self.connection_file is not set (by default) at the time the ports are “checked”, simultaneous kernel starts will always check against “kernel-ipc-N” (where N is 1…5) and always succeed due depending on the race condition, and thus the collision.

As a workaround, you could simply bring your own kernel-id and connection-file values:

import os
import uuid

from jupyter_client.manager import KernelManager
from jupyter_core.paths import jupyter_runtime_dir


class IPCKernelManager(KernelManager):
    def __init__(self, *args, **kwargs):
        kernel_id = str(uuid.uuid4())
        connection_file = os.path.join(jupyter_runtime_dir(), f"kernel-{kernel_id}.json")
        super().__init__(*args, transport = "ipc", kernel_id=kernel_id, connection_file=connection_file, **kwargs)

Note that the current code creates the IPC files in the current directory and not colocated with the connection file, whereas this will colocate the two sets (and better IMHO).

kevin-bates on Jun 1, 2022

Thanks for the great tip! This is very close (and even easier than you said), but still seeing some issues:

Given the following files:

./__init__.py

./test.py
#!/usr/local/bin/python
import papermill as pm
import tempfile

with tempfile.TemporaryDirectory() as tmpdirname:
    pm.execute_notebook(
        "/path/to/any/notebook.ipynb",
        tmpdirname + "/ignored-output.ipynb",
        progress_bar=False,
        timeout=1800,
        kernel_manager_class="IPCKernelManager.IPCKernelManager",
    )

./test_runner.sh
#!/bin/bash

for i in {1..2}
do
    ./test.py &
done

./IPCKernelManager.py
from jupyter_client.manager import KernelManager


class IPCKernelManager(KernelManager):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, transport = "ipc", **kwargs)

Running test.py seems to work. But not test_runner.sh. I think this is because the ipc channels are all overlapping and I get a ton of messages like:

Traceback (most recent call last):
  File "/usr/local/python/python-3.9/std/lib64/python3.9/site-packages/ipykernel/kernelbase.py", line 359, in dispatch_shell
    msg = self.session.deserialize(msg, content=True, copy=False)
  File "/usr/local/python/python-3.9/std/lib64/python3.9/site-packages/jupyter_client/session.py", line 1054, in deserialize
    raise ValueError("Invalid Signature: %r" % signature)
ValueError: Invalid Signature: b'4ccdd36edf8bc494aba12bae8f5d8de9f21887bde0bd05a36ba387a43780f7d6'

Feels like we are one path away from this work! Any further tips @kevin-bates?

mlucool on Jun 1, 2022

Hi @davidbrochart - Formally speaking, that proposal is still in the JEP process - although it appears to have a majority of approvals. To the best of my knowledge, Jupyter Server still plans on adopting this approach because it believes multiple (and simultaneous) kernel manager implementations should be supported. I think we should view the current jupyter_client implementation essentially becoming the basis for the default kernel provider implementation. However, given the focus and work put into the jupyter_client package recently, I think we’re looking at a longer transition period - which is probably fine. This discussion is probably better suited in jupyter_client or the JEP itself.

kevin-bates on Jun 8, 2020

Yup, #487 is a fundamental limitation of the parent selecting random kernel-allocated TCP ports in certain circumstances, relying on the kernel’s TIME_WAIT implementation to keep the socket available, which it doesn’t strictly need to do. We should have a standard way to eliminate this; it was one of the several things the kernel nanny addition was meant to solve, I think. The parent selecting from its own non-kernel-owned port range pool is another way to do it.

For the simplest reliable workaround today, if you are running on localhost and not Windows, you can select ipc transport instead of tcp:

c.KernelManager.transport = "ipc"
# or `--KernelManager.transport=ipc` on the command-line

The IPC transport doesn’t have any race conditions on port allocation, since we can use kernel UUIDs in paths to eliminate the possibility of collision.

minrk on Jun 8, 2020