core: More frequent disconnects of HomeKit Thread devices in 2022.12 (CoAP POST returned unexpected code)

The problem

Since 2022.12, I seem to be experiencing more disconnects for my HomeKit Thread devices (2 Eve light switches, 1 WeMo Scene remote). There is a desync error in the logs that is much more frequent than before. There has been no big change to my network architecture.

Home Assistant 2022.12.1 Supervisor 2022.11.2 Operating System 9.3 Frontend 20221208.0 - latest

What version of Home Assistant Core has the issue?

2022.12.1

What was the last working version of Home Assistant Core?

2022.11

What type of installation are you running?

Home Assistant OS

Integration causing the issue

homekit_controller

Link to integration documentation on our website

https://www.home-assistant.io/integrations/homekit_controller

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Logger: aiohomekit.controller.coap.connection
Source: components/homekit_controller/connection.py:726
First occurred: 14:21:40 (23 occurrences)
Last logged: 14:40:48

Decryption failed, desynchronized? Counter=0/3
Pair verify timed out
Decryption failed, desynchronized? Counter=11/13
CoAP POST returned unexpected code <aiocoap.Message at 0x7f5a909390: ACK 4.04 Not Found (MID 36576, token 53b4) remote <UDP6EndpointAddress [fdbc:107:de33:0:2fbc:f566:6fe8:4bd3] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>>
Failed flailing attempts to resynchronize, self-destructing in 3, 2, 1...

Additional information

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 22 (13 by maintainers)

Most upvoted comments

The thread diagnostic download is in the current stable release.

We now know that there are a multitude of bugs and config errors at play that all disrupt thread, not just in home assistant but with Linux in general.

This post assumes that the most basic config errors have been resolved. For example, no vlans, and that WiFi repeaters aren’t disrupting mdns or icmp6 (some are known to really break stuff).

We know that older versions of NetworkManager (which most desktop Linux uses, and HAOS 9.5 does too) had a bug I call ghost routes. Whenever a BR changes link local address, the Linux box remembers the old address and adds the new one to its route table. In time you can end up with 10 ghost routes. Depending on other settings in other parts of the Linux stack these routes continue to be used as if they were valid. You can end up with a 10:1 chance of failure.

Newer versions of HAOS use newer versions of NetworkManager. These solve the problem… by only allowing a single border router. Every time a new border router announces itself it forgets the current one. With 3 BRs we expect to see the route table churn once a minute on average. If those changes were not atomic (for example, a remove was done and then an add) then there would be a tiny window once a minute where the mesh would be inaccessible. This may not matter in practice but is a concern. More practicality, it does mean there is no hope of failover until the next announcement.

HAOS 10 final release (it’s not in rc2) will carry a patch to allow NM to track multiple BRs.

Of course the next problem is that for some environments the kernels Neighbour Unreachable Detection is not working. When a neighbour is considered stale by the kernel it is probed by icmp6 packets. If 3 probes fail, it is marked as a failed neighbour. Failed neighbours are scored lower when making routing decisions. At least that’s what’s supposed to happen.

If you have ip forwarding turned on (eg you are running an OTBR, or some container setups) the kernel disables this feature. In this scenario, with a working network manager, your network could go down for 30 minutes every time an ip changes.

HAOS10 final (again not in rc2) will have a kernel patch to avoid this. This is not upstream yet. So it’s broken for anyone running supervised on their own OS (if they have forwarding enabled). Potentially for people running the container directly.

If you are running HA Core on a system without systemd-networkd or NetworkManager and you don’t have forwarding enabled, you likely have a very reliable network for running thread BRs. Oh wait no. Because by default Linux actually drops route advertisements of the type Thread sends. So you need to manually configure sysctls.

We have also seen weak mesh manifest in the same way. Turning off a Br with weak mesh (often an Apple TV in a closet) can spring a mesh back to life with no further intervention.

Then there are the BR bugs. We are still seeing Brs rotating their mesh prefixes fairly rapidly. When everything keeps changing it’s ip it’s kinda hard to be stable.

I have had HA core running like this since August and I still see a blip every one to two weeks. Restarting HA does help. That’s probably a HA bug. But then again, sometimes waking a device (pushing a physical button) seems to get it back in line too. So it might not be.

It’s not, your device is sending data for an old encryption key that homeassistant isn’t using at the time it’s received. It could be related but it’s distinct enough it needs a separate ticket.