papermill: zmq.error.ZMQError: Address already in use, when running multiprocessing with multiple notebooks using papermill
I am using the papermill library to run multiple notebooks using multiprocessing simultaneously.
This is occurring on Python 3.6.6, Red Hat 4.8.2-15 within a Docker container.
However when I run the python script, about 5% of my notebooks do not work immediately (No Jupyter Notebook cells run) due to me receiving this error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
app.initialize(argv)
File "<decorator-gen-124>", line 2, in initialize
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 469, in initialize
self.init_sockets()
File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 238, in init_sockets
self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
s.bind("tcp://%s:%i" % (self.ip, port))
File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use
along with:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "main.py", line 77, in run_papermill
pm.execute_notebook(notebook, output_path, parameters=config)
File "/opt/conda/lib/python3.6/site-packages/papermill/execute.py", line 104, in execute_notebook
**engine_kwargs
File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 49, in execute_notebook_with_engine
return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 304, in execute_notebook
nb = cls.execute_managed_notebook(nb_man, kernel_name, log_output=log_output, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 372, in execute_managed_notebook
preprocessor.preprocess(nb_man, safe_kwargs)
File "/opt/conda/lib/python3.6/site-packages/papermill/preprocess.py", line 20, in preprocess
with self.setup_preprocessor(nb_man.nb, resources, km=km):
File "/opt/conda/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 345, in setup_preprocessor
self.km, self.kc = self.start_new_kernel(**kwargs)
File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 296, in start_new_kernel
kc.wait_for_ready(timeout=self.startup_timeout)
File "/opt/conda/lib/python3.6/site-packages/jupyter_client/blocking/client.py", line 104, in wait_for_ready
raise RuntimeError('Kernel died before replying to kernel_info')
RuntimeError: Kernel died before replying to kernel_info
Please help me with this problem, as I have scoured the web trying different solutions, none that have worked for my case so far.
This error rate of 5% occurs regardless of the number of notebooks I run simultaneously or the number of cores on my computer which makes it extra curious.
I have tried changing the start method and updating the libraries but to no avail.
The version of my libraries are:
papermill==1.2.1
ipython==7.14.0
jupyter-client==6.1.3
pyzmq==17.1.2
Thank you!
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 3
- Comments: 18 (8 by maintainers)
A small improvement on @kevin-bates
IPCKernelManagerclass to make sure the runtime directory exists:@mlucool - thanks for the update.
Each of the jupyter-based directories (runtime, config, data) can be “redirected” via envs. Namely,
JUPYTER_RUNTIME_DIR,JUPYTER_CONFIG_DIR, andJUPYTER_DATA_DIR, respectively. In situations where the files require a high degree of access (like the IPC files), you’re probably better off pointing the corresponding env to a local directory, which would also benefit other jupyter-based applications in that particular configuration and allow the same code to run irrespective of the underlying env values.@kevin-bates I thought you may be interested in this from a protocol POV: the runtime dir did not work great on NFS for the above test. I saw race conditions resulting in
RuntimeError: Kernel didn't respond in 60 seconds. Moving this to local disk solves any issues.That worked great - thanks!! This seems to fully solve the issue if you can use IPC.
I’m not familiar with the IPC transport, but looking into this for a bit, it appears there needs to be better “randomization” happening when checking for the ipc “port” existence.
In the case of “transport == ipc” the
ipvalue is either “kernel-ipc” or “kernel-kernel_id-ipc” where the latter is really the value ofself.connection_filesans the.jsonsuffix (and would provide sufficient randomization). However, becauseself.connection_fileis not set (by default) at the time the ports are “checked”, simultaneous kernel starts will always check against “kernel-ipc-N” (where N is 1…5) and always succeed due depending on the race condition, and thus the collision.As a workaround, you could simply bring your own kernel-id and connection-file values:
Note that the current code creates the IPC files in the current directory and not colocated with the connection file, whereas this will colocate the two sets (and better IMHO).
Thanks for the great tip! This is very close (and even easier than you said), but still seeing some issues:
Given the following files:
Running
test.pyseems to work. But nottest_runner.sh. I think this is because the ipc channels are all overlapping and I get a ton of messages like:Feels like we are one path away from this work! Any further tips @kevin-bates?
Hi @davidbrochart - Formally speaking, that proposal is still in the JEP process - although it appears to have a majority of approvals. To the best of my knowledge, Jupyter Server still plans on adopting this approach because it believes multiple (and simultaneous) kernel manager implementations should be supported. I think we should view the current jupyter_client implementation essentially becoming the basis for the default kernel provider implementation. However, given the focus and work put into the jupyter_client package recently, I think we’re looking at a longer transition period - which is probably fine. This discussion is probably better suited in
jupyter_clientor the JEP itself.Yup, #487 is a fundamental limitation of the parent selecting random kernel-allocated TCP ports in certain circumstances, relying on the kernel’s TIME_WAIT implementation to keep the socket available, which it doesn’t strictly need to do. We should have a standard way to eliminate this; it was one of the several things the kernel nanny addition was meant to solve, I think. The parent selecting from its own non-kernel-owned port range pool is another way to do it.
For the simplest reliable workaround today, if you are running on localhost and not Windows, you can select ipc transport instead of tcp:
The IPC transport doesn’t have any race conditions on port allocation, since we can use kernel UUIDs in paths to eliminate the possibility of collision.