operating-system: Docker key.json has invalid contents and the system refuses to boot

Describe the issue you are experiencing

I’m 2 for 2 now on this. Each reboot after an upgrade to the OS somehow leads to a corrupt key.json file on my Home Assistant OS install. Each time I have to boot via a recovery USB drive, mount the right partition (in this case /dev/sda7) and delete the /etc/docker/key.json file to fix the system. I don’t intentionally manage this file in any way, nor do I care about the contents, so the Docker generated version is just fine with me.

As a test, I fixed the file, booted into Home Assistant and then restarted back into my recovery USB to inspect the contents. They were valid JSON instead of the invalid content.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

6.6

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. Update Home Assistant OS via supervisor in the web interface
  2. Home Assistant then fails to reboot into a valid system

Anything in the Supervisor logs that might be useful for us?

Nothing in there from the update or prior to the successful boot of my system.

Anything in the Host logs that might be useful for us?

Nothing

System Health information

System Health

version core-2021.11.1
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.9.7
os_name Linux
os_version 5.10.88
arch x86_64
timezone America/Los_Angeles
Home Assistant Community Store
GitHub API ok
Github API Calls Remaining 4890
Installed Version 1.16.0
Stage running
Available Repositories 932
Installed Repositories 9
Home Assistant Cloud
logged_in true
subscription_expiration January 13, 2022, 4:00 PM
relayer_connected true
remote_enabled false
remote_connected false
alexa_enabled true
google_enabled false
remote_server us-west-2-1.ui.nabu.casa
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Home Assistant Supervisor
host_os Home Assistant OS 7.1
update_channel stable
supervisor_version supervisor-2021.12.2
docker_version 20.10.9
disk_total 219.4 GB
disk_used 7.4 GB
healthy true
supported true
board generic-x86-64
supervisor_api ok
version_api ok
installed_addons chrony (2.2.1), Samba share (9.5.1), Z-Wave JS to MQTT (0.27.0), Node-RED (10.1.1), Terminal & SSH (9.2.1), Network UPS Tools (0.9.0)
keymaster
zwave_integration zwave_js
network_status on
Lovelace
dashboards 4
resources 6
views 8
mode storage

Additional information

This is the contents of the file: PXL_20220106_031713090

And this is the log entries from journalctl: PXL_20220106_032525837

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (7 by maintainers)

Commits related to this issue

Most upvoted comments

I just hit this after an unclean shutdown. The debugging experience was really bad because of the boot loop.

On a hunch, based on the boot loop and not having found this issue yet, I grepped this repo for FailureAction=reboot and found that in docker.service, so I booted with systemd.mask=docker.service and tried to run dockerd manually, which exited with an error about key.json, which indeed was full of null bytes. I removed it and rebooted and all was back to normal.

There is a docker-failure script which seems to automate this, which I guess will be in the next release, but the messages it prints will never be seen on hardware that reboots quickly.

You could consider using StartLimitAction=reboot instead of FailureAction=reboot, with StartLimitIntervalSec and StartLimitBurst set accordingly, so systemd will try to start docker a few times, giving an operator a chance to see which service is failing and any output from the docker-failure script, before rebooting if it continues to fail to start. This also means that if the problem was a corrupted key.json, it will successfully start on the second try with no need for a reboot.

I don’t know much about docker, but does key.json actually need to be persisted? If it’s just a key for docker clients to communicate with dockerd, why not just generate a random one before docker starts on each boot?

One way to remove the corrupted key.json is to use e to edit the boot entry, go at the end of the 3rd line and add systemd.unit=rescue.target, then F10. This should boot into a rescue shell. Enter it using Enter, then type rm /mnt/overlay/etc/docker/key.json.

I believe I just fell victim to this as well and with HAOS 9.3 !!!

how to correct the corrupted key.json ?

You can just delete the file at /mnt/overlay/etc/docker/key.json

and using this still boot loops

Hm, that then sounds like a different problem. systemd.unit=rescue.target really should not attempt to start docker daemon, and therefor not cause a bootloop. Can you share the boot log when using rescue.target?

With HAOS 9.0 a invalid key file will be detected, see https://github.com/home-assistant/operating-system/pull/1988.

I think it is a terrible idea to boot loop for ever, so fast that you can’t read what is causing the issue, on a project made for low skill user. IMHO this should be revert to droping in a rescue shell.

The boot loop is implemented so that the bootloader can switch to the other (presumably good) installation (HAOS has two OS installation, A and B. Each upgrade updates the other, not currenlty running system). However, if the old presumably good installation is not booting as well, simply rebooting is indeed not helpful. Currently we don’t detect that situation/behave accordingly.

This case is somewhat special as data corruption in the shared overlay partition causes both installations (A and B) to fail.

The file, who isn’t in the homeassistant_data partition but one called “homeassistant_overlay” (don’t know what it’s used for), was full of `\00\00\00\00\00\00…00" a very long line. Removing the file fixed the issue

We use the overlay partition to make certain parts of /etc writeable. It is a simple ext4 file system with some directories bind mounted to certain directories in real /etc, like /etc/docker.

Now your case looks a corruption happened to that partition. However, as written in https://github.com/home-assistant/operating-system/issues/1706#issuecomment-1006432086 I don’t really understand why that can happen in first place: ext4 is a journaling file system, and Docker uses Atomic write for this particular file.

If that is indeed a more common problem, maybe we should sanity check that file or something.