operating-system: Docker key.json has invalid contents and the system refuses to boot

Describe the issue you are experiencing

I’m 2 for 2 now on this. Each reboot after an upgrade to the OS somehow leads to a corrupt key.json file on my Home Assistant OS install. Each time I have to boot via a recovery USB drive, mount the right partition (in this case /dev/sda7) and delete the /etc/docker/key.json file to fix the system. I don’t intentionally manage this file in any way, nor do I care about the contents, so the Docker generated version is just fine with me.

As a test, I fixed the file, booted into Home Assistant and then restarted back into my recovery USB to inspect the contents. They were valid JSON instead of the invalid content.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

6.6

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

Update Home Assistant OS via supervisor in the web interface
Home Assistant then fails to reboot into a valid system

Anything in the Supervisor logs that might be useful for us?

Nothing in there from the update or prior to the successful boot of my system.

Anything in the Host logs that might be useful for us?

Nothing

System Health information

System Health

version	core-2021.11.1
installation_type	Home Assistant OS
dev	false
hassio	true
docker	true
user	root
virtualenv	false
python_version	3.9.7
os_name	Linux
os_version	5.10.88
arch	x86_64
timezone	America/Los_Angeles

Home Assistant Community Store

GitHub API	ok
Github API Calls Remaining	4890
Installed Version	1.16.0
Stage	running
Available Repositories	932
Installed Repositories	9

Home Assistant Cloud

logged_in	true
subscription_expiration	January 13, 2022, 4:00 PM
relayer_connected	true
remote_enabled	false
remote_connected	false
alexa_enabled	true
google_enabled	false
remote_server	us-west-2-1.ui.nabu.casa
can_reach_cert_server	ok
can_reach_cloud_auth	ok
can_reach_cloud	ok

Home Assistant Supervisor

host_os	Home Assistant OS 7.1
update_channel	stable
supervisor_version	supervisor-2021.12.2
docker_version	20.10.9
disk_total	219.4 GB
disk_used	7.4 GB
healthy	true
supported	true
board	generic-x86-64
supervisor_api	ok
version_api	ok
installed_addons	chrony (2.2.1), Samba share (9.5.1), Z-Wave JS to MQTT (0.27.0), Node-RED (10.1.1), Terminal & SSH (9.2.1), Network UPS Tools (0.9.0)

keymaster

zwave_integration	zwave_js
network_status	on

Lovelace

dashboards	4
resources	6
views	8
mode	storage

Additional information

This is the contents of the file: PXL_20220106_031713090

And this is the log entries from journalctl: PXL_20220106_032525837

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 18 (7 by maintainers)

Commits related to this issue

Remove key.json file if it appears to be corrupted (#1706) — committed to home-assistant/operating-system by agners 2 years ago
Remove key.json file if it appears to be corrupted (#1706) (#1988) * Remove key.json file if it appears to be corrupted (#1706) * Check with jq if key.json is parsable — committed to home-assistant/operating-system by agners 2 years ago

Most upvoted comments

I just hit this after an unclean shutdown. The debugging experience was really bad because of the boot loop.

On a hunch, based on the boot loop and not having found this issue yet, I grepped this repo for FailureAction=reboot and found that in docker.service, so I booted with systemd.mask=docker.service and tried to run dockerd manually, which exited with an error about key.json, which indeed was full of null bytes. I removed it and rebooted and all was back to normal.

There is a docker-failure script which seems to automate this, which I guess will be in the next release, but the messages it prints will never be seen on hardware that reboots quickly.

You could consider using StartLimitAction=reboot instead of FailureAction=reboot, with StartLimitIntervalSec and StartLimitBurst set accordingly, so systemd will try to start docker a few times, giving an operator a chance to see which service is failing and any output from the docker-failure script, before rebooting if it continues to fail to start. This also means that if the problem was a corrupted key.json, it will successfully start on the second try with no need for a reboot.

I don’t know much about docker, but does key.json actually need to be persisted? If it’s just a key for docker clients to communicate with dockerd, why not just generate a random one before docker starts on each boot?

jbg on Jul 22, 2022

One way to remove the corrupted key.json is to use e to edit the boot entry, go at the end of the 3rd line and add systemd.unit=rescue.target, then F10. This should boot into a rescue shell. Enter it using Enter, then type rm /mnt/overlay/etc/docker/key.json.

agners on Jun 24, 2022

I believe I just fell victim to this as well and with HAOS 9.3 !!!

how to correct the corrupted key.json ?

You can just delete the file at /mnt/overlay/etc/docker/key.json

and using this still boot loops

Hm, that then sounds like a different problem. systemd.unit=rescue.target really should not attempt to start docker daemon, and therefor not cause a bootloop. Can you share the boot log when using rescue.target?

agners on Nov 28, 2022

With HAOS 9.0 a invalid key file will be detected, see https://github.com/home-assistant/operating-system/pull/1988.

agners on Sep 12, 2022

I think it is a terrible idea to boot loop for ever, so fast that you can’t read what is causing the issue, on a project made for low skill user. IMHO this should be revert to droping in a rescue shell.

The boot loop is implemented so that the bootloader can switch to the other (presumably good) installation (HAOS has two OS installation, A and B. Each upgrade updates the other, not currenlty running system). However, if the old presumably good installation is not booting as well, simply rebooting is indeed not helpful. Currently we don’t detect that situation/behave accordingly.

This case is somewhat special as data corruption in the shared overlay partition causes both installations (A and B) to fail.

The file, who isn’t in the homeassistant_data partition but one called “homeassistant_overlay” (don’t know what it’s used for), was full of `\00\00\00\00\00\00…00" a very long line. Removing the file fixed the issue

We use the overlay partition to make certain parts of /etc writeable. It is a simple ext4 file system with some directories bind mounted to certain directories in real /etc, like /etc/docker.

Now your case looks a corruption happened to that partition. However, as written in https://github.com/home-assistant/operating-system/issues/1706#issuecomment-1006432086 I don’t really understand why that can happen in first place: ext4 is a journaling file system, and Docker uses Atomic write for this particular file.

If that is indeed a more common problem, maybe we should sanity check that file or something.

agners on Jun 18, 2022