addons: Silicon Labs Multiprotocol addon failing intermittently causing HAP, Thread and Zigbee issues

Describe the issue you are experiencing

Every few days to a week, the Silicon Labs Multiprotocol addon will stop communicating and indicate ‘resource temporarily unavailable’. yet does not make mention of which resource this is. When this addon stops, it breaks my light integration (Nanoleaf via Thread using HAP) and Zigbee sensors via ZHA.

Restarting the addon doesn’t fix the issue, nor does restarting home assistant. Usually requires a complete reboot of the host, and even then sometimes it will repeatedly indicate ‘resource temporarily unavailable’.

What type of installation are you running?

Home Assistant OS

Which operating system are you running on?

Home Assistant Operating System

Which add-on are you reporting an issue with?

Silicon Labs Multiprotocol

What is the version of the add-on?

2.3.2

Steps to reproduce the issue

Wish I knew, as it fails whenever it wants to (sometimes 3am in the morning, sometimes 5pm in the afternoon).

System Health information

System Information

version core-2023.8.4
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.11.4
os_name Linux
os_version 6.1.37
arch x86_64
timezone Australia/Adelaide
config_dir /config
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
GitHub API Calls Remaining 5000
Installed Version 1.32.1
Stage running
Available Repositories 1335
Downloaded Repositories 37
AccuWeather
can_reach_server ok
remaining_requests 14
Home Assistant Cloud
logged_in false
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud failed to load: unreachable
Home Assistant Supervisor
host_os Home Assistant OS 11.0.dev20230705
update_channel beta
supervisor_version supervisor-2023.08.3
agent_version 1.5.1
docker_version 23.0.6
disk_total 30.8 GB
disk_used 23.3 GB
healthy true
supported true
board ova
supervisor_api ok
version_api ok
installed_addons Samba share (10.0.2), Network UPS Tools (0.12.0), Matter Server (4.9.0), Silicon Labs Multiprotocol (2.3.2), Mosquitto broker (6.2.1), Zigbee2MQTT (1.32.2-1), Custom deps deployment (1.3.3), Home Assistant Google Drive Backup (0.111.1), ESPHome (2023.8.2), Terminal & SSH (9.7.1), Studio Code Server (5.10.1), PS5 MQTT (1.3.1), Whisper (1.0.0), Piper (1.3.2), Advanced SSH & Web Terminal (15.0.7), Silicon Labs Flasher (0.2.0)
Dashboards
dashboards 3
resources 24
views 11
mode storage
Recorder
oldest_recorder_run 16 August 2023 at 08:32
current_recorder_run 25 August 2023 at 20:34
estimated_db_size 933.92 MiB
database_engine sqlite
database_version 3.41.2

Anything in the Supervisor logs that might be useful for us?

Nothing relating to these issues.

Anything in the add-on logs that might be useful for us?

Logger: bellows.zigbee.application
Source: /usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py:643
First occurred: 10:49:25 (4040 occurrences)
Last logged: 17:00:02

ControllerApplication reset unsuccessful: ConnectionRefusedError(111, "Connect call failed ('172.30.32.1', 9999)")
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 640, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 661, in _reset_controller
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 133, in connect
    self._ezsp = await bellows.ezsp.EZSP.initialize(self.config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 164, in initialize
    await ezsp.connect(use_thread=zigpy_config[conf.CONF_USE_THREAD])
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 181, in connect
    self._gw = await bellows.uart.connect(self._config, self, use_thread=use_thread)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 414, in connect
    protocol, _ = await _connect(config, application)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 385, in _connect
    transport, protocol = await zigpy.serial.create_serial_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/serial.py", line 31, in create_serial_connection
    transport, protocol = await loop.create_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1085, in create_connection
    raise exceptions[0]
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1069, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 973, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 628, in sock_connect
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 668, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('172.30.32.1', 9999)

Additional information

HAOS is hosted in a VM on my Unraid server. My Unraid server is still able to see and interact with the USB device(s) when HAOS fails to. For context:

image

I have the SkyConnect USB as well as a Sonoff Zigbee 3.0 USB Dongle Plus V2” (model “ZBDongle-E”). I mainly use the Skyconnect for everything, and the Sonoff is a recent purchase. Both are flashed with the latest version of the MultiPAN firmware.

I have tried both the stable and beta version of HAOS, no change to the outcome. Sometimes it works for several days, sometimes it fails > 5 times a day.

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Reactions: 5
  • Comments: 114 (4 by maintainers)

Most upvoted comments

I have two new test firmwares based on the latest GSDK v4.3.2. Let me know how they work for you.

These recover automatically after a crash, as does the addon, which should eliminate the need to unplug your SkyConnect in the future if something breaks. They also should hopefully fix the underlying crashing issue with OpenThread.

You can flash the SkyConnect from the firmware update page (https://skyconnect.home-assistant.io/firmware-update/) with a compatible browser. For the Yellow, you will have to use a console flasher.

Make sure you disable automatic firmware installation in the multiprotocol addon and save! Otherwise, it will reinstall the old version automatically: image

4.3.2 crashes so frequently (i.e. within a minute) that it is completely unusable, unfortunately.

The newly introduced bug seems unrelated so we will have to wait until 4.3.3 is released, hopefully with a fix for both.

Yes, I have auto flashing on, so I’m no longer on the custom firmware but just the add-on firmware. Things work great.

Indeed, you should re-enable firmware flashing when updating.

There is currently a packaging issue with the addon that should be fixed in two releases.

Could you be more specific?

I was debugging cpcd ( https://github.com/SiliconLabs/cpc-daemon/tree/main ), according to docs/troubleshooting.md.

The errors were happening in https://github.com/SiliconLabs/cpc-daemon/blob/main/driver/driver_uart.c , the packets were not being received correctly, with the packet length being incorrect.

I was using GDB, so I don’t have a good tracing output, but I can try to re-run it all with the debug tracing in cpcd.

To anyone having issues after flashing, make sure your baudrate is set to 460800.

Here is a build of 4.3.1 for the SkyConnect with the watchdog and without Zigbee Green Power support.

@puddly: I flashed it yesterday and it just crashed, unfortunately no improvement for me. I don’t have any GP devices, mostly IKEA and some Hue, Aqara, Ledvance and Tuya devices. Crashes 1-2 times per day.

EDIT: Thread is used for 2 Eve Energy devices.

Hum interesting. Isn’t that how the Nanoleaf Desktop app can control my Thread+Matter bulbs or this have nothing to do with it? Because I’m not pure OTBR and yet the application works (not all the time but sometimes yes)

This is not related. Afaik, the Nanoleaf bulb talk their own IP based protocol with the bulbs. And that works with both add-ons.

By default, any Thread border router which gets a frame routes it through the (RF-only) mesh. Imagine you have two border routers, and one is close and one is far away from a particular bulb. Without TREL, the packet will travel through the RF mesh, even though it is far away. With TREL, the frame will get forwarded to the closer router via Ethernet, and only then goes through the RF network. If you have a single border router, and for smaller mesh networks it doesn’t’ really matter. But it can make ea difference for large network.

@agners

Especially since you don’t have any Zigbee devices, I can really recommend switching to the pure/dedicated firmware.

From a Multiprotocol stand point, it would also be interesting to see if you have the same errors on the pure OTBR firmware (if so, then it is probably more related to you RF environment).

Ok I’ll try switching to pure OTBR in few days and see how it goes.

Btw, the pure OTBR has TREL enabled as well. TREL allows Thread border routers to pass Thread frames through WiFi/Ethernet, and hence lower the network load on the mesh. One of the main selling points of Thread IMHO 🤩

Hum interesting. Isn’t that how the Nanoleaf Desktop app can control my Thread+Matter bulbs or this have nothing to do with it? Because I’m not pure OTBR and yet the application works (not all the time but sometimes yes)

@MattWestb

@agners Your screenshot is private so cant look on it 😦

I can see them. Could be a temporary connectivity/github issue or an ad blocker maybe ?

ChannelAccessFailure is the radio firmware reporting that it is refusing to send a packet: the channel is congested. This means that your environment is too noisy. Make sure your SkyConnect is away from interference sources such as SSDs, 2.4GHz WiFi, USB 3.0 ports, and so on.

@puddly Thanks! That appears to have changed the USB ID…

s6-rc: info: service universal-silabs-flasher: starting
[20:20:18] INFO: Checking /dev/ttyUSB0 identifying SkyConnect v1.0 from Nabu Casa.
[20:20:18] INFO: Starting universal-silabs-flasher with /dev/ttyUSB0
2024-01-30 20:20:19.071 macmini71 universal_silabs_flasher.flash INFO Extracted GBL metadata: NabuCasaMetadata(metadata_version=1, sdk_version='4.3.1', ezsp_version='7.3.1.0', ot_rcp_version='SL-OPENTHREAD/2.3.1.0_GitHub-e6df00dd6' (2.3.1.0), cpc_version='4.3.1-4f7f9e99-dirty-de58d93e' (4.3.1), fw_type=<FirmwareImageType.RCP_UART_802154: 'rcp-uart-802154'>, baudrate=460800)
2024-01-30 20:20:19.071 macmini71 universal_silabs_flasher.flasher INFO Probing ApplicationType.GECKO_BOOTLOADER at 115200 baud
2024-01-30 20:20:21.077 macmini71 universal_silabs_flasher.flasher INFO Probing ApplicationType.CPC at 460800 baud
2024-01-30 20:20:21.092 macmini71 universal_silabs_flasher.flasher INFO Detected ApplicationType.CPC, version '4.3.2' at 460800 baudrate (bootloader baudrate None)
2024-01-30 20:20:21.092 macmini71 universal_silabs_flasher.flash INFO Firmware version '4.3.1-4f7f9e99-dirty-de58d93e' (4.3.1) does not match expected version '4.3.2'
2024-01-30 20:20:21.608 macmini71 universal_silabs_flasher.flasher INFO Probing ApplicationType.GECKO_BOOTLOADER at 115200 baud
2024-01-30 20:20:22.626 macmini71 universal_silabs_flasher.flasher INFO Detected bootloader version '2.1.1'
2024-01-30 20:20:22.626 macmini71 universal_silabs_flasher.flasher INFO Detected ApplicationType.GECKO_BOOTLOADER, version '2.1.1' at 115200 baudrate (bootloader baudrate 115200)
NabuCasa_SkyConnect_RCP_v4.3.1_rcp-uart-hw-802154_460800.gbl
s6-rc: info: service universal-silabs-flasher successfully started
s6-rc: info: service cpcd-config: starting
[20:20:43] INFO: Using known baudrate of 460800 for cpcd!

Is this what I should be seeing?

@satmandu You have one of the very rare batch of SkyConnects that don’t identify as a SkyConnect.

You can fix it by installing the SkyConnect CP2102N Programmer addon from the development repo (https://github.com/home-assistant/addons-development/) and running it. The SkyConnect should then be identified properly after you unplug it and plug it back in.

All current versions are affected, unfortunately, so there’s no version that you can roll back to. You can try out the firmware I have here but it doesn’t include all of the changes in the addon so it would be easiest to wait for a few days.

Sure.

  1. Stop the Silicon Labs Multiprotocol addon.
  2. In the addon configuration, disable automatic firmware flashing and click “Save”.
  3. Install the “Silicon Labs Flasher” addon.
  4. Use https://gist.github.com/theblackhole/0bf08addc9ad30bdc431e34503cc7a12/raw/9d7739e1b88fcc7e57ec05f1da3860db7287e3c3/yellow-multipan-432-watchdog-nofastchannelswitching.gbl as the firmware URL (thanks @theblackhole)
  5. Flash the new firmware.

You can then restart the Multiprotocol addon.

I’ve tried to do a deep dive with a debugger into the code, and it looks like something is wrong with the firmware on the stick itself. It sends incorrectly-sized packets.

For now, I suggest disabling the multiprotocol support and using SkyConnect just for ZigBee (ZHA). Sonoff stick is just $30, so buy it instead for Thread.

@puddly Got it! (And yes I installed the addon a long time ago) I flashed your firmware with the Silicon Labs Flasher addon and it works so far, now we wait and see… 👀

Btw to anyone who want (like me) to flash directly in the HA interface via the Silabs Flasher addon, I uploaded the 2 firmware files in a gist This way you can use the “view raw” url of your gbl file in the flasher addon config.

image

Ah. We changed the baudrate from 115200 to 460800 a few months ago and it looks like the option to “override” what’s in your addon config gets disabled if you disable automatic flashing. That’s why it stops working if you disable the option. You probably installed the addon a long time ago and had 115200 selected.

We’ll probably deprecate baudrate selection entirely in the future, as 460800 works fine and we have no plans on increasing it for now.

@MattWestb Zigbee works without issues for almost 4 days now. I’ll try “OTBR only” next weekend.

Here is a build of 4.3.1 for the SkyConnect with the watchdog and without Zigbee Green Power support. Try it out and see if it still crashes for you: skyconnect_wd_nogp_4.3.1.gbl.zip

@MattWestb I will try raising the baudrate to see if that causes it to become more unstable then!

I’ll think I wait now a bit how silicon labs reacts on the replicated bug and try to find a trigger.

This can take months, unfortunately, and the one I reported may not be the only one. Any observations are helpful and you can help everyone significantly if you can help us track your bug down!

I’m going to try to compile a firmware over the weekend that omits a feature (Zigbee Green Power) responsible for the bug I noted. If you want to test that out, that would also be helpful!

Can you describe the Zigbee and Thread devices you have on your networks? Any Zigbee Green Power?

I’ve been able to replicate a crash and will try to get a firmware out that possibly mitigates it but there may be multiple concurrent bugs here causing issues.

Any progress made on the fix?

Seems reproducible: same symptoms this morning (crash loop cpcd, restarting container results in firmware flasher probe failures, replug SkyConnect fixes it).

I’ve just re-flashed the SkyConnect with your build again, just to ensure I didn’t make an error last time. Will comment if anything changes.

Just to ensure clarity: I had to un-plug and re-plug the SkyConnect. The new firmware didn’t seem to have an effect.

Whatever initially breaks with the dongle triggers the cpcd crash loop but leaves the container up. Subsequently restarting the whole container results in the firmware flasher attempting to probe the device at multiple baudrates, failing, and crashing the container (and triggering watchdog restart).

After re-plugging the dongle, the container comes up on the next watchdog restart.

I suspect if you tweak the cpcd finish script to crash the container we’d see 1 cpcd crash followed by many probe failures.

Perfect, thank you for the feedback. The watchdog isn’t integrated tightly into CPC so it looks like just the CPC part is crashing, not the whole firmware. I’ll post an updated one later next week.

Thanks!

The metadata will be identical, especially when probing. This also makes sure the addon doesn’t re-flash the bundled firmware.

Update:

Only way to restore Home Assistant to a functional state was to unplug the SkyConnect from its USB port, replug, then reboot Unraid. Simply turning the VM for HAOS off and on nor just rebooting the host without the replug was not enough.

FWIW, I have a PCIe USB Hub card in the server, however the SkyConnect USB is in the motherboards USB2.0 port, rather than the hub, to try and limit the hub being a cause.