addons: Silicon Labs Multiprotocol addon failing intermittently causing HAP, Thread and Zigbee issues
Describe the issue you are experiencing
Every few days to a week, the Silicon Labs Multiprotocol addon will stop communicating and indicate ‘resource temporarily unavailable’. yet does not make mention of which resource this is. When this addon stops, it breaks my light integration (Nanoleaf via Thread using HAP) and Zigbee sensors via ZHA.
Restarting the addon doesn’t fix the issue, nor does restarting home assistant. Usually requires a complete reboot of the host, and even then sometimes it will repeatedly indicate ‘resource temporarily unavailable’.
What type of installation are you running?
Home Assistant OS
Which operating system are you running on?
Home Assistant Operating System
Which add-on are you reporting an issue with?
Silicon Labs Multiprotocol
What is the version of the add-on?
2.3.2
Steps to reproduce the issue
Wish I knew, as it fails whenever it wants to (sometimes 3am in the morning, sometimes 5pm in the afternoon).
System Health information
System Information
| version | core-2023.8.4 |
|---|---|
| installation_type | Home Assistant OS |
| dev | false |
| hassio | true |
| docker | true |
| user | root |
| virtualenv | false |
| python_version | 3.11.4 |
| os_name | Linux |
| os_version | 6.1.37 |
| arch | x86_64 |
| timezone | Australia/Adelaide |
| config_dir | /config |
Home Assistant Community Store
| GitHub API | ok |
|---|---|
| GitHub Content | ok |
| GitHub Web | ok |
| GitHub API Calls Remaining | 5000 |
| Installed Version | 1.32.1 |
| Stage | running |
| Available Repositories | 1335 |
| Downloaded Repositories | 37 |
AccuWeather
| can_reach_server | ok |
|---|---|
| remaining_requests | 14 |
Home Assistant Cloud
| logged_in | false |
|---|---|
| can_reach_cert_server | ok |
| can_reach_cloud_auth | ok |
| can_reach_cloud | failed to load: unreachable |
Home Assistant Supervisor
| host_os | Home Assistant OS 11.0.dev20230705 |
|---|---|
| update_channel | beta |
| supervisor_version | supervisor-2023.08.3 |
| agent_version | 1.5.1 |
| docker_version | 23.0.6 |
| disk_total | 30.8 GB |
| disk_used | 23.3 GB |
| healthy | true |
| supported | true |
| board | ova |
| supervisor_api | ok |
| version_api | ok |
| installed_addons | Samba share (10.0.2), Network UPS Tools (0.12.0), Matter Server (4.9.0), Silicon Labs Multiprotocol (2.3.2), Mosquitto broker (6.2.1), Zigbee2MQTT (1.32.2-1), Custom deps deployment (1.3.3), Home Assistant Google Drive Backup (0.111.1), ESPHome (2023.8.2), Terminal & SSH (9.7.1), Studio Code Server (5.10.1), PS5 MQTT (1.3.1), Whisper (1.0.0), Piper (1.3.2), Advanced SSH & Web Terminal (15.0.7), Silicon Labs Flasher (0.2.0) |
Dashboards
| dashboards | 3 |
|---|---|
| resources | 24 |
| views | 11 |
| mode | storage |
Recorder
| oldest_recorder_run | 16 August 2023 at 08:32 |
|---|---|
| current_recorder_run | 25 August 2023 at 20:34 |
| estimated_db_size | 933.92 MiB |
| database_engine | sqlite |
| database_version | 3.41.2 |
Anything in the Supervisor logs that might be useful for us?
Nothing relating to these issues.
Anything in the add-on logs that might be useful for us?
Logger: bellows.zigbee.application
Source: /usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py:643
First occurred: 10:49:25 (4040 occurrences)
Last logged: 17:00:02
ControllerApplication reset unsuccessful: ConnectionRefusedError(111, "Connect call failed ('172.30.32.1', 9999)")
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 640, in _reset_controller_loop
await self._reset_controller()
File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 661, in _reset_controller
await self.connect()
File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 133, in connect
self._ezsp = await bellows.ezsp.EZSP.initialize(self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 164, in initialize
await ezsp.connect(use_thread=zigpy_config[conf.CONF_USE_THREAD])
File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 181, in connect
self._gw = await bellows.uart.connect(self._config, self, use_thread=use_thread)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 414, in connect
protocol, _ = await _connect(config, application)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 385, in _connect
transport, protocol = await zigpy.serial.create_serial_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/zigpy/serial.py", line 31, in create_serial_connection
transport, protocol = await loop.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1085, in create_connection
raise exceptions[0]
File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1069, in create_connection
sock = await self._connect_sock(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/base_events.py", line 973, in _connect_sock
await self.sock_connect(sock, address)
File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 628, in sock_connect
return await fut
^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 668, in _sock_connect_cb
raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('172.30.32.1', 9999)
Additional information
HAOS is hosted in a VM on my Unraid server. My Unraid server is still able to see and interact with the USB device(s) when HAOS fails to. For context:
I have the SkyConnect USB as well as a Sonoff Zigbee 3.0 USB Dongle Plus V2” (model “ZBDongle-E”). I mainly use the Skyconnect for everything, and the Sonoff is a recent purchase. Both are flashed with the latest version of the MultiPAN firmware.
I have tried both the stable and beta version of HAOS, no change to the outcome. Sometimes it works for several days, sometimes it fails > 5 times a day.
About this issue
- Original URL
- State: open
- Created 10 months ago
- Reactions: 5
- Comments: 114 (4 by maintainers)
I have two new test firmwares based on the latest GSDK v4.3.2. Let me know how they work for you.
These recover automatically after a crash, as does the addon, which should eliminate the need to unplug your SkyConnect in the future if something breaks. They also should hopefully fix the underlying crashing issue with OpenThread.
You can flash the SkyConnect from the firmware update page (https://skyconnect.home-assistant.io/firmware-update/) with a compatible browser. For the Yellow, you will have to use a console flasher.
Make sure you disable automatic firmware installation in the multiprotocol addon and save! Otherwise, it will reinstall the old version automatically:
4.3.2 crashes so frequently (i.e. within a minute) that it is completely unusable, unfortunately.
The newly introduced bug seems unrelated so we will have to wait until 4.3.3 is released, hopefully with a fix for both.
Yes, I have auto flashing on, so I’m no longer on the custom firmware but just the add-on firmware. Things work great.
Indeed, you should re-enable firmware flashing when updating.
There is currently a packaging issue with the addon that should be fixed in two releases.
I was debugging cpcd ( https://github.com/SiliconLabs/cpc-daemon/tree/main ), according to
docs/troubleshooting.md.The errors were happening in https://github.com/SiliconLabs/cpc-daemon/blob/main/driver/driver_uart.c , the packets were not being received correctly, with the packet length being incorrect.
I was using GDB, so I don’t have a good tracing output, but I can try to re-run it all with the debug tracing in cpcd.
To anyone having issues after flashing, make sure your baudrate is set to 460800.
@puddly: I flashed it yesterday and it just crashed, unfortunately no improvement for me. I don’t have any GP devices, mostly IKEA and some Hue, Aqara, Ledvance and Tuya devices. Crashes 1-2 times per day.
EDIT: Thread is used for 2 Eve Energy devices.
This is not related. Afaik, the Nanoleaf bulb talk their own IP based protocol with the bulbs. And that works with both add-ons.
By default, any Thread border router which gets a frame routes it through the (RF-only) mesh. Imagine you have two border routers, and one is close and one is far away from a particular bulb. Without TREL, the packet will travel through the RF mesh, even though it is far away. With TREL, the frame will get forwarded to the closer router via Ethernet, and only then goes through the RF network. If you have a single border router, and for smaller mesh networks it doesn’t’ really matter. But it can make ea difference for large network.
@agners
Ok I’ll try switching to pure OTBR in few days and see how it goes.
Hum interesting. Isn’t that how the Nanoleaf Desktop app can control my Thread+Matter bulbs or this have nothing to do with it? Because I’m not pure OTBR and yet the application works (not all the time but sometimes yes)
@MattWestb
I can see them. Could be a temporary connectivity/github issue or an ad blocker maybe ?
ChannelAccessFailureis the radio firmware reporting that it is refusing to send a packet: the channel is congested. This means that your environment is too noisy. Make sure your SkyConnect is away from interference sources such as SSDs, 2.4GHz WiFi, USB 3.0 ports, and so on.@puddly Thanks! That appears to have changed the USB ID…
Is this what I should be seeing?
@satmandu You have one of the very rare batch of SkyConnects that don’t identify as a SkyConnect.
You can fix it by installing the SkyConnect CP2102N Programmer addon from the development repo (https://github.com/home-assistant/addons-development/) and running it. The SkyConnect should then be identified properly after you unplug it and plug it back in.
All current versions are affected, unfortunately, so there’s no version that you can roll back to. You can try out the firmware I have here but it doesn’t include all of the changes in the addon so it would be easiest to wait for a few days.
Sure.
https://gist.github.com/theblackhole/0bf08addc9ad30bdc431e34503cc7a12/raw/9d7739e1b88fcc7e57ec05f1da3860db7287e3c3/yellow-multipan-432-watchdog-nofastchannelswitching.gblas the firmware URL (thanks @theblackhole)You can then restart the Multiprotocol addon.
I’ve tried to do a deep dive with a debugger into the code, and it looks like something is wrong with the firmware on the stick itself. It sends incorrectly-sized packets.
For now, I suggest disabling the multiprotocol support and using SkyConnect just for ZigBee (ZHA). Sonoff stick is just $30, so buy it instead for Thread.
@puddly Got it! (And yes I installed the addon a long time ago) I flashed your firmware with the Silicon Labs Flasher addon and it works so far, now we wait and see… 👀
Btw to anyone who want (like me) to flash directly in the HA interface via the Silabs Flasher addon, I uploaded the 2 firmware files in a gist This way you can use the “view raw” url of your gbl file in the flasher addon config.
Ah. We changed the baudrate from 115200 to 460800 a few months ago and it looks like the option to “override” what’s in your addon config gets disabled if you disable automatic flashing. That’s why it stops working if you disable the option. You probably installed the addon a long time ago and had 115200 selected.
We’ll probably deprecate baudrate selection entirely in the future, as 460800 works fine and we have no plans on increasing it for now.
@MattWestb Zigbee works without issues for almost 4 days now. I’ll try “OTBR only” next weekend.
Here is a build of 4.3.1 for the SkyConnect with the watchdog and without Zigbee Green Power support. Try it out and see if it still crashes for you: skyconnect_wd_nogp_4.3.1.gbl.zip
@MattWestb I will try raising the baudrate to see if that causes it to become more unstable then!
This can take months, unfortunately, and the one I reported may not be the only one. Any observations are helpful and you can help everyone significantly if you can help us track your bug down!
I’m going to try to compile a firmware over the weekend that omits a feature (Zigbee Green Power) responsible for the bug I noted. If you want to test that out, that would also be helpful!
Can you describe the Zigbee and Thread devices you have on your networks? Any Zigbee Green Power?
I’ve been able to replicate a crash and will try to get a firmware out that possibly mitigates it but there may be multiple concurrent bugs here causing issues.
Any progress made on the fix?
Seems reproducible: same symptoms this morning (crash loop cpcd, restarting container results in firmware flasher probe failures, replug SkyConnect fixes it).
I’ve just re-flashed the SkyConnect with your build again, just to ensure I didn’t make an error last time. Will comment if anything changes.
Just to ensure clarity: I had to un-plug and re-plug the SkyConnect. The new firmware didn’t seem to have an effect.
Whatever initially breaks with the dongle triggers the
cpcdcrash loop but leaves the container up. Subsequently restarting the whole container results in the firmware flasher attempting to probe the device at multiple baudrates, failing, and crashing the container (and triggering watchdog restart).After re-plugging the dongle, the container comes up on the next watchdog restart.
I suspect if you tweak the cpcd
finishscript to crash the container we’d see 1 cpcd crash followed by many probe failures.Perfect, thank you for the feedback. The watchdog isn’t integrated tightly into CPC so it looks like just the CPC part is crashing, not the whole firmware. I’ll post an updated one later next week.
Thanks!
The metadata will be identical, especially when probing. This also makes sure the addon doesn’t re-flash the bundled firmware.
Update:
Only way to restore Home Assistant to a functional state was to unplug the SkyConnect from its USB port, replug, then reboot Unraid. Simply turning the VM for HAOS off and on nor just rebooting the host without the replug was not enough.
FWIW, I have a PCIe USB Hub card in the server, however the SkyConnect USB is in the motherboards USB2.0 port, rather than the hub, to try and limit the hub being a cause.