tensorflow: Training stopping because of BufferError: Existing exports of data: object cannot be re-sized or something wrong with tornado
Click to expand!
Issue Type
Bug
Have you reproduced the bug with TF nightly?
No
Source
source
Tensorflow Version
2.12.0
Custom Code
Yes
OS Platform and Distribution
NAME=“CentOS Linux” VERSION=“7 (Core)”
Mobile device
NAME=“CentOS Linux” VERSION=“7 (Core)”
Python version
3.9.16
Bazel version
No response
GCC/Compiler version
No response
CUDA/cuDNN version
11.8.0
GPU model and memory
No response
Current Behaviour?
The model training would just stop abruptly
https://colab.research.google.com/drive/1WiqyF7dCdnNBIANEY80Pxw_mVz4fyV-S?usp=sharing
Standalone code to reproduce the issue
Voxelmoprh library training
Relevant log output
(tf) vr-lab@pop-os:~$ jupyter notebook
_ _ _ _
| | | |_ __ __| |__ _| |_ ___
| |_| | '_ \/ _` / _` | _/ -_)
\___/| .__/\__,_\__,_|\__\___|
|_|
Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.
https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html
Please note that updating to Notebook 7 might break some of your extensions.
[I 00:02:49.290 NotebookApp] Serving notebooks from local directory: /home/vr-lab
[I 00:02:49.290 NotebookApp] Jupyter Notebook 6.5.4 is running at:
[I 00:02:49.290 NotebookApp] http://localhost:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
[I 00:02:49.290 NotebookApp] or http://127.0.0.1:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
[I 00:02:49.290 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 00:02:49.334 NotebookApp]
To access the notebook, open this file in a browser:
file:///home/vr-lab/.local/share/jupyter/runtime/nbserver-405435-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
or http://127.0.0.1:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
[I 00:03:15.170 NotebookApp] Kernel started: 4915aa8a-d4aa-4d50-885f-810d53eae7db, name: python3
[I 00:03:20.670 NotebookApp] Kernel restarted: 4915aa8a-d4aa-4d50-885f-810d53eae7db
[W 00:03:20.684 NotebookApp] Replacing stale connection: 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[W 00:03:21.180 NotebookApp] zmq message arrived on closed channel
[I 00:03:21.181 NotebookApp] Starting buffering for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[I 00:03:21.183 NotebookApp] Restoring connection for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[I 00:03:21.689 NotebookApp] Replaying 1 buffered messages
[E 00:03:21.761 NotebookApp] Uncaught exception, closing connection.
Traceback (most recent call last):
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 702, in _handle_events
self._handle_write()
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 976, in _handle_write
self._write_buffer.advance(num_bytes)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 182, in advance
assert 0 < size <= self._size
AssertionError
[W 00:03:21.764 NotebookApp] Write error on <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>: [Errno 9] Bad file descriptor
[W 00:03:21.766 NotebookApp] zmq message arrived on closed channel
[W 00:03:21.767 NotebookApp] zmq message arrived on closed channel
Exception in callback None()
handle: <Handle cancelled>
Traceback (most recent call last):
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 206, in _handle_events
handler_func(fileobj, events)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 702, in _handle_events
self._handle_write()
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 976, in _handle_write
self._write_buffer.advance(num_bytes)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 182, in advance
assert 0 < size <= self._size
AssertionError
[I 00:03:21.768 NotebookApp] Starting buffering for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
2023-04-11 00:03:22.084618: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-11 00:03:22.225493: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[I 00:03:22.803 NotebookApp] Restoring connection for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[I 00:03:22.803 NotebookApp] Replaying 1 buffered messages
2023-04-11 00:03:22.815590: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/vr-lab/anaconda3/envs/tf/lib/:/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/nvidia/cudnn/lib
2023-04-11 00:03:22.815709: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/vr-lab/anaconda3/envs/tf/lib/:/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/nvidia/cudnn/lib
2023-04-11 00:03:22.815716: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-04-11 00:03:25.015062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 12776 MB memory: -> device: 0, name: NVIDIA RTX A4000, pci bus id: 0000:af:00.0, compute capability: 8.6
2023-04-11 00:03:40.078576: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8600
[I 00:05:15.159 NotebookApp] Saving file at /Music/HybridMorph Please don't delete/HybridMorph_proof of concept.ipynb
Task exception was never retrieved
future: <Task finished name='Task-76' coro=<WebSocketProtocol13.write_message.<locals>.wrapper() done, defined at /home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py:1090> exception=WebSocketClosedError()>
Traceback (most recent call last):
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1092, in wrapper
await fut
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/asyncio/tasks.py", line 256, in __step
result = coro.send(None)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1094, in wrapper
raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
[E 01:03:52.904 NotebookApp] Exception in callback <bound method WebSocketMixin.send_ping of ZMQChannelsHandler(4915aa8a-d4aa-4d50-885f-810d53eae7db)>
Traceback (most recent call last):
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
val = self.callback()
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 188, in send_ping
self.ping(b'')
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 445, in ping
self.ws_connection.write_ping(data)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1101, in write_ping
self._write_frame(True, 0x9, data)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
self._write_buffer.append(data)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized
[E 01:13:22.812 NotebookApp] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
self._write_buffer.append(data)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized
[E 01:13:22.815 NotebookApp] Uncaught exception in zmqstream callback
Traceback (most recent call last):
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
self._handle_recv()
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
self._run_callback(callback, msg)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
self._write_buffer.append(data)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized
[E 01:13:22.815 NotebookApp] Exception in callback functools.partial(<function ZMQStream._update_handler.<locals>.<lambda> at 0x7f1de4ff4b80>)
Traceback (most recent call last):
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/ioloop.py", line 740, in _run_callback
ret = callback()
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 718, in <lambda>
self.io_loop.add_callback(lambda: self._handle_events(self.socket, 0))
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
self._handle_recv()
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
self._run_callback(callback, msg)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
self._write_buffer.append(data)
File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 26 (6 by maintainers)
I am getting the same error and stacktrace with a pytorch model with MPS backend from a jupyter notebook. The model continues training, but output stops streaming to jupyter. I suspect the problem is actually with jupyter and that websocket that allows streaming data from the python backend to the output cell.
as @tgoMota mentioned try keeping
verbose=2
This worked for me
Thanks a lot for the reply, Good luck.
@CaffineAddic, Apologies for the delay. We are working on the issue and will update the status here. Thank you!