tensorboard: Read record error

Environment TensorBoard 2.5.0; Tensorflow 2.5.0

Please run diagnose_tensorboard.py (link below) in the same environment from which you normally run TensorFlow/TensorBoard, and paste the output here:

Diagnostics

Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version e43767ef2b648d0d5d57c00f38ccbd38390e38da

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=6, micro=9, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='3e52c33c1851', release='5.4.0-74-generic', version='#83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None
WARNING: The directory '/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.

--- check: installed_packages
INFO: installed: tensorboard==2.5.0
INFO: installed: tensorflow==2.5.0
INFO: installed: tensorflow-estimator==2.5.0rc0
INFO: installed: tensorboard-data-server==0.6.1

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.5.0'
2021-07-08 20:57:09.014915: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0

--- check: tensorflow_python_version
INFO: tensorflow.__version__: '2.5.0'
INFO: tensorflow.__git_version__: 'v2.5.0-rc3-213-ga4dfb8d1a71'

--- check: tensorboard_data_server_version
INFO: data server binary: '/usr/local/lib/python3.6/dist-packages/tensorboard_data_server/bin/server'
Traceback (most recent call last):
  File "/workspace/diagnose_tensorboard.py", line 522, in main
    suggestions.extend(check())
  File "/workspace/diagnose_tensorboard.py", line 75, in wrapper
    result = fn()
  File "/workspace/diagnose_tensorboard.py", line 301, in tensorboard_data_server_version
    check=True,
  File "/usr/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
TypeError: __init__() got an unexpected keyword argument 'capture_output'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/usr/local/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): '3e52c33c1851'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=107217744, st_dev=53, st_nlink=2, st_uid=3003, st_gid=3003, st_size=4096, st_atime=1624626790, st_mtime=1625777517, st_ctime=1625777517)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/usr/local/lib/python3.6/dist-packages']; bad_roots (0): []
WARNING: The directory '/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==0.12.0
anyio==3.2.1
appdirs==1.4.4
argon2-cffi==20.1.0
asn1crypto==0.24.0
astunparse==1.6.3
async-generator==1.10
attrs==21.2.0
audioread==2.1.9
Babel==2.9.1
backcall==0.2.0
bleach==3.3.0
cached-property==1.5.2
cachetools==4.2.2
certifi==2020.12.5
cffi==1.14.5
chardet==4.0.0
cloudpickle==1.6.0
contextvars==2.4
cryptography==2.1.4
cycler==0.10.0
dataclasses==0.8
decorator==5.0.9
defusedxml==0.7.1
dill==0.3.3
dm-tree==0.1.6
entrypoints==0.3
flatbuffers==1.12
future==0.18.2
gast==0.4.0
google-auth==1.30.0
google-auth-oauthlib==0.4.4
google-pasta==0.2.0
googleapis-common-protos==1.53.0
graphviz==0.16
grpcio==1.34.1
h5py==3.1.0
horovod==0.22.0
idna==2.6
immutables==0.15
importlib-metadata==4.0.1
importlib-resources==5.1.3
ipykernel==5.5.5
ipython==7.16.1
ipython-genutils==0.2.0
jedi==0.18.0
Jinja2==3.0.1
joblib==1.0.1
json5==0.9.6
jsonschema==3.2.0
jupyter-client==6.1.12
jupyter-core==4.7.1
jupyter-server==1.9.0
jupyterlab==3.0.16
jupyterlab-pygments==0.1.2
jupyterlab-server==2.6.0
keras-nightly==2.5.0.dev2021032900
Keras-Preprocessing==1.1.2
keyring==10.6.0
keyrings.alt==3.0
kiwisolver==1.3.1
librosa==0.8.1
llvmlite==0.36.0
Markdown==3.3.4
MarkupSafe==2.0.1
matplotlib==3.3.4
mistune==0.8.4
nbclassic==0.3.1
nbclient==0.5.3
nbconvert==6.0.7
nbformat==5.1.3
nest-asyncio==1.5.1
notebook==6.4.0
numba==0.53.1
numpy==1.19.5
oauthlib==3.1.0
opt-einsum==3.3.0
packaging==20.9
pandocfilters==1.4.3
parso==0.8.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.2.0
pip==20.2.4
pooch==1.3.0
prometheus-client==0.11.0
promise==2.3
prompt-toolkit==3.0.19
protobuf==3.17.0
psutil==5.8.0
ptyprocess==0.7.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.20
pycrypto==2.6.1
pygit2==1.6.1
Pygments==2.9.0
pygobject==3.26.1
pyparsing==2.4.7
pyrsistent==0.17.3
python-apt==1.6.5+ubuntu0.5
python-dateutil==2.8.1
pytz==2021.1
pyxdg==0.25
PyYAML==5.4.1
pyzmq==22.1.0
requests==2.25.1
requests-oauthlib==1.3.0
requests-unixsocket==0.2.0
resampy==0.2.2
rsa==4.7.2
scikit-learn==0.24.2
scipy==1.5.4
SecretStorage==2.3.1
Send2Trash==1.7.1
setuptools==56.2.0
six==1.15.0
sniffio==1.2.0
SoundFile==0.10.3.post1
ssh-import-id==5.7
tensorboard==2.5.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.5.0
tensorflow-addons==0.13.0
tensorflow-datasets==4.3.0
tensorflow-estimator==2.5.0rc0
tensorflow-metadata==0.30.0
tensorflow-probability==0.12.2
termcolor==1.1.0
terminado==0.10.1
testpath==0.5.0
threadpoolctl==2.1.0
tornado==6.1
tqdm==4.60.0
traitlets==4.3.3
typeguard==2.12.0
typing-extensions==3.7.4.3
urllib3==1.26.4
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.1.0
Werkzeug==2.0.0
wheel==0.36.2
wrapt==1.12.1
zipp==3.4.1

Next steps

No action items identified. Please copy ALL of the above output, including the lines containing only backticks, into your GitHub issue or comment. Be sure to redact any sensitive information.

Issue I see the following error when trying to run Tensorboard to visualize logs:

[ WARN rustboard_core::run] Read error in /workspace/log/events.out.tfevents.1624569739.8a1c50246dae.30295.276717.v2: ReadRecordError(BadLengthCrc(ChecksumError { got: MaskedCrc(0x07980329), want: MaskedCrc(0x00000000) }))

For some context, I am using a shared file system between several machines and this error comes up when trying to run Tensorboard on machine A, pointing to a log generated by machine B.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 21 (3 by maintainers)

Most upvoted comments

This is still an issue. When multiple machines are writing logs to a shared samba storage, the logs are not processes correctly. Sometimes tensorboard reads several epochs before the failure. Perhaps the read operation is performed during writing of the log file, and on first sign of error, tensorboard blocks the offending file and is not trying to re-read the log file later.

[2022-05-23T09:08:56Z WARN rustboard_core::run] Read error in ./logs/20220520-110941/events.out.tfevents.1653045009.g12.3001.0.v2: ReadRecordError(BadLengthCrc(ChecksumError { got: MaskedCrc(0x07980329), want: MaskedCrc(0x00000000) }))

To get around the issue, I need to restart tensorboard every time I want to look at the results.

Using fast data loading tensorboard --logdir /path/to/logs --load_fast true solved this issue. https://github.com/tensorflow/tensorboard/issues/4784

@svobora That kind of remark is uncalled for, please be polite if you’re going to participate in the issue thread. I agree it’s not ultimately the exact same issue, but it’s reasonable that @rkechols was confused.

If your concern is that this issue should be re-opened, I can re-open and re-title it accordingly, but there’s no guarantee that we’ll address it right away, especially since this issue only affects cases where the file transiently appears to be corrupted.

Thanks for the report. Summaries written by TensorFlow’s tf.summary.* APIs are supposed to produce event files with proper checksums. The error could indicate that either the event file in question has been modified, or perhaps was written in an unexpected way. Could you share how is machine B logging events?

If the machine in question is not using something like

import tensorflow as tf
writer = tf.summary.create_file_writer('test/logdir')
with writer.as_default():
    tf.summary.scalar('loss', 0.345, step=1)

would it be possible to share the summary writing code for us to investigate?

If you trust the source that logs your summary data, and do not care about this checksum warning, it is also possible to suppress the check by passing extra flags to tensorboard: tensorboard --logdir my_logdir --extra_data_server_flags=--no-checksum

@nfelt it turns out you were right that the file was corrupted “in transit”. I re-copied the file a different way and had no issue.

@rkechols There shouldn’t be any constraint about TB logs only being readable on certain machines; the file format itself should be portable. Most likely the file itself is being corrupted during the copy somehow. I’d suggest taking a checksum of the file contents before and after copying it (e.g. the SHA-256 hash or something) to confirm if that’s happening.

I’m getting the same problem after copying the log files from a remote machine (Linux) to my local machine (MacOS). Does this mean that TB logs can only be read on the machine where they were created? Copying the file somehow corrupts it?

This is still an issue. When multiple machines are writing logs to a shared samba storage, the logs are not processes correctly. Sometimes tensorboard reads several epochs before the failure. Perhaps the read operation is performed during writing of the log file, and on first sign of error, tensorboard blocks the offending file and is not trying to re-read the log file later.

[2022-05-23T09:08:56Z WARN rustboard_core::run] Read error in ./logs/20220520-110941/events.out.tfevents.1653045009.g12.3001.0.v2: ReadRecordError(BadLengthCrc(ChecksumError { got: MaskedCrc(0x07980329), want: MaskedCrc(0x00000000) }))

To get around the issue, I need to restart tensorboard every time I want to look at the results.

Still having this problem.

Had this exact problem when TB was reading logs exported from another machine via a NFS share.

I’ve instrumented RustBoard to print out the contents of the last read block on CRC errors, and sure enough, it was all zeros!
Probably not RustBoard’s fault though, I would be inclined to blame this on a faulty NFS driver, especially since similar bugs had happened in the past.

Interestingly enough, this problem does not occur when the log files are read from Python (which I tried to create a repro case). Not sure why, maybe this is just due to the difference in speed. But this allowed me to work around it by writing a Python script that replicates logs to a local file system from where they are read by TB.