operating-system: Docker key.json has invalid contents and the system refuses to boot
Describe the issue you are experiencing
I’m 2 for 2 now on this. Each reboot after an upgrade to the OS somehow leads to a corrupt key.json file on my Home Assistant OS install. Each time I have to boot via a recovery USB drive, mount the right partition (in this case /dev/sda7) and delete the /etc/docker/key.json file to fix the system. I don’t intentionally manage this file in any way, nor do I care about the contents, so the Docker generated version is just fine with me.
As a test, I fixed the file, booted into Home Assistant and then restarted back into my recovery USB to inspect the contents. They were valid JSON instead of the invalid content.
What operating system image do you use?
generic-x86-64 (Generic UEFI capable x86-64 systems)
What version of Home Assistant Operating System is installed?
6.6
Did you upgrade the Operating System.
Yes
Steps to reproduce the issue
- Update Home Assistant OS via supervisor in the web interface
- Home Assistant then fails to reboot into a valid system
Anything in the Supervisor logs that might be useful for us?
Nothing in there from the update or prior to the successful boot of my system.
Anything in the Host logs that might be useful for us?
Nothing
System Health information
System Health
| version | core-2021.11.1 |
|---|---|
| installation_type | Home Assistant OS |
| dev | false |
| hassio | true |
| docker | true |
| user | root |
| virtualenv | false |
| python_version | 3.9.7 |
| os_name | Linux |
| os_version | 5.10.88 |
| arch | x86_64 |
| timezone | America/Los_Angeles |
Home Assistant Community Store
| GitHub API | ok |
|---|---|
| Github API Calls Remaining | 4890 |
| Installed Version | 1.16.0 |
| Stage | running |
| Available Repositories | 932 |
| Installed Repositories | 9 |
Home Assistant Cloud
| logged_in | true |
|---|---|
| subscription_expiration | January 13, 2022, 4:00 PM |
| relayer_connected | true |
| remote_enabled | false |
| remote_connected | false |
| alexa_enabled | true |
| google_enabled | false |
| remote_server | us-west-2-1.ui.nabu.casa |
| can_reach_cert_server | ok |
| can_reach_cloud_auth | ok |
| can_reach_cloud | ok |
Home Assistant Supervisor
| host_os | Home Assistant OS 7.1 |
|---|---|
| update_channel | stable |
| supervisor_version | supervisor-2021.12.2 |
| docker_version | 20.10.9 |
| disk_total | 219.4 GB |
| disk_used | 7.4 GB |
| healthy | true |
| supported | true |
| board | generic-x86-64 |
| supervisor_api | ok |
| version_api | ok |
| installed_addons | chrony (2.2.1), Samba share (9.5.1), Z-Wave JS to MQTT (0.27.0), Node-RED (10.1.1), Terminal & SSH (9.2.1), Network UPS Tools (0.9.0) |
keymaster
| zwave_integration | zwave_js |
|---|---|
| network_status | on |
Lovelace
| dashboards | 4 |
|---|---|
| resources | 6 |
| views | 8 |
| mode | storage |
Additional information
This is the contents of the file:

And this is the log entries from journalctl:

About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 18 (7 by maintainers)
Commits related to this issue
- Remove key.json file if it appears to be corrupted (#1706) — committed to home-assistant/operating-system by agners 2 years ago
- Remove key.json file if it appears to be corrupted (#1706) (#1988) * Remove key.json file if it appears to be corrupted (#1706) * Check with jq if key.json is parsable — committed to home-assistant/operating-system by agners 2 years ago
I just hit this after an unclean shutdown. The debugging experience was really bad because of the boot loop.
On a hunch, based on the boot loop and not having found this issue yet, I grepped this repo for
FailureAction=rebootand found that indocker.service, so I booted withsystemd.mask=docker.serviceand tried to rundockerdmanually, which exited with an error aboutkey.json, which indeed was full of null bytes. I removed it and rebooted and all was back to normal.There is a
docker-failurescript which seems to automate this, which I guess will be in the next release, but the messages it prints will never be seen on hardware that reboots quickly.You could consider using
StartLimitAction=rebootinstead ofFailureAction=reboot, withStartLimitIntervalSecandStartLimitBurstset accordingly, so systemd will try to start docker a few times, giving an operator a chance to see which service is failing and any output from thedocker-failurescript, before rebooting if it continues to fail to start. This also means that if the problem was a corruptedkey.json, it will successfully start on the second try with no need for a reboot.I don’t know much about docker, but does
key.jsonactually need to be persisted? If it’s just a key for docker clients to communicate with dockerd, why not just generate a random one before docker starts on each boot?One way to remove the corrupted
key.jsonis to useeto edit the boot entry, go at the end of the 3rd line and addsystemd.unit=rescue.target, thenF10. This should boot into a rescue shell. Enter it usingEnter, then typerm /mnt/overlay/etc/docker/key.json.You can just delete the file at
/mnt/overlay/etc/docker/key.jsonHm, that then sounds like a different problem.
systemd.unit=rescue.targetreally should not attempt to start docker daemon, and therefor not cause a bootloop. Can you share the boot log when using rescue.target?With HAOS 9.0 a invalid key file will be detected, see https://github.com/home-assistant/operating-system/pull/1988.
The boot loop is implemented so that the bootloader can switch to the other (presumably good) installation (HAOS has two OS installation, A and B. Each upgrade updates the other, not currenlty running system). However, if the old presumably good installation is not booting as well, simply rebooting is indeed not helpful. Currently we don’t detect that situation/behave accordingly.
This case is somewhat special as data corruption in the shared overlay partition causes both installations (A and B) to fail.
We use the overlay partition to make certain parts of
/etcwriteable. It is a simple ext4 file system with some directories bind mounted to certain directories in real/etc, like/etc/docker.Now your case looks a corruption happened to that partition. However, as written in https://github.com/home-assistant/operating-system/issues/1706#issuecomment-1006432086 I don’t really understand why that can happen in first place:
ext4is a journaling file system, and Docker uses Atomic write for this particular file.If that is indeed a more common problem, maybe we should sanity check that file or something.