core: Coordinator crash when adding a device

The problem

Issue

Coordinator crash when a new devices joins or re-join the network.

ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart

Steps to reproduce the issue

Add a new device to the network or reapply mains power to it.

Controller fails immediately. May take more or less time for Home-Assistant to be able to reset the controller. Can end up stuck indefinitely when adding a new device until it is reset. Usually helps when Home-Assistant is restarted entirely.

Effect

Philips Hue lights start a flashing patern in sync while the controler is failed. Similar with IKEA devices but stroboscopic effect. The coordinator is not reachable thus, it is not possible to control devices.

ezgif-4-fa36b502b4

Coordinator

Generic Aliexpress EFR32MG21 : https://www.aliexpress.com/item/1005003578599189.html?spm=a2g0o.order_list.0.0.46f01802MxU13p

Hue light

What version of Home Assistant Core has the issue?

core-2022.11.2

What type of installation are you running?

Home Assistant Container

Integration causing the issue

ZHA

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

home-assistant.log Here’s my network backup as I wonder if something is not wrong with it : ZHA backup 2022-11-12T16-58-07.216Z.txt

Example YAML snippet

zha:
  custom_quirks_path: /config/quirks/
  zigpy_config:
    #source_routing: true
    ezsp_config:
      #CONFIG_SOURCE_ROUTE_TABLE_SIZE: 150 // Used to have source routing but disabling it does not change the behaviour.
      CONFIG_APS_ACK_TIMEOUT: 8000
      CONFIG_ADDRESS_TABLE_SIZE: 8
      CONFIG_APS_UNICAST_MESSAGE_COUNT: 12
    ota:
      ikea_provider: true
      otau_directory: /config/OTAU/

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 73 (49 by maintainers)

Most upvoted comments

So I repaired all routers of the network. At first, I could not pair anything except directly thru the coordinator but I think that’s fixed now, I had to pair directly with the coordinator or a device with the new trust center link key to make it work.

I then did some tests ;

  • Unplugging the bulb.
  • Disconnecting the coordinator.
  • Pair new devices.

All of which used to make the thermostats think there is a network conflict and make all my lights flash. I think it’s quite an annoying feature from Philips to make lights flash that horrible pattern upon network conflicts but oh well.

Interestingly it suddenly stopped when the problematic Philips Hue bulb got its proper configuration. So the issue resolved before I reset the thermostats. Not sure why since I taught they were the issue.

I’m gonna hope it’s the good one. I will close this issue now in the hope I will not need to open it again.

To conclude, it appears the missing trust center link key was the issue here. I’m not sure why mine disappeared since I always had one.

Thanks Puddly, Adminiuga, MattWestb, TheJulianJES and Dmulcahey for the help provided in this issue here and on other platforms. It really helped! I hope you all know how your time is valued here.

I hope you all have a nice day! ☀️

I flashed the 6.10.3 firmware from : https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/ncp-uart-sw_v6.10.3_115200.gbl then https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/nvm3_initfile.gbl but the coordinator was not responding correctly on the serial connection. I did the inverse and it worked.

I still can’t add new devices now, it’s very strange. I also confirmed the coordinator still crash : home-assistant_after_erasing_NVM+crash.log

I indeed came from a EM358X (HUSBZB-1) which I have get rid of a while ago. My system is a full fledged server so I’d be surprised there would be performance issues from there.

It could be that the Philips joined while you were running the other coordinator and it got wrong tclink key cached. Factory reset that Philips bulb and any other bulb/device which uses wrong IEEE for coordinator and causes address conflicts.

Possibly completely unrelated but bellows apparently had an issue with APS link key generation when restoring from a backup: https://github.com/zigpy/bellows/pull/518/commits/b7a3925c23b4db1c52b09fccccd6888efef8ed0c Not sure if the frame counter(s) would make a difference though(?)

Do Philips bulbs flash like that when there is an address conflict?

When there’s an address conflict, the conflicting device would leave and re-join the network -> this would cause the light to blink because of the identify command.

Then I can see an insane amount of network conflicts coming from my 4 Sinope thermostats and it goes very fast, i’m wondering if that’s what crash my coordinator :

The “nwk conflict” is a network broadcast and it is possible it is causing a crash, because you have reported quite high setting for the broadcast table, unless CONFIG_BROADCAST_TABLE_SIZE: 15 setting was accepted successfuly.

Before you get the 1st address conflict, what is the command which triggers it? Usually it is one very close to the 1st address conflict messages in the batch.

What Address is reported to be conflicting? I see two scenarios for the address conflict:

  1. either the device(s) reporting the conflict have wrong information about IEEE<>NWK addr mapping when it hears a request
  2. the conflicting device is indeed using a conflicting NWK address. Check the packet capture if there was any other traffic from the same NWK

= the problem is not the coordinator its the network that cant routing traffic and then the paring is not working to routers then unicast is not working so the coordinator cant talking to the new device.

It does. 😞 I removed the power from the newer bulb and re-applied it and the coordinator crashed : NCP entered failed state. Requesting APP controller restart

ControllerApplication reset unsuccessful: TimeoutError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 643, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 665, in _reset_controller
    await self.initialize()
  File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 76, in initialize
    await self.load_network_info(load_devices=False)
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 257, in load_network_info
    brd_manuf, brd_name, version = await self._get_board_info()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 117, in _get_board_info
    return await self._ezsp.get_board_info()
  File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/__init__.py", line 299, in get_board_info
    (value,) = await self.getMfgToken(token)
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

home-assistant_crash_2022_11_20.log

There is no crash when I do the same for older devices of the network.

I also can’t add new devices to the coordinator following the restore by zigpy cli.

@MattWestb : I’ll try erasing the NVM portion tomorrow as you recommended. You say it is used to store keys?

Thanks. According to the restore, one was written:

... stack_specific={'ezsp': {'hashed_tclk': 'a2473867c61c6c4e43e764b18dc95164'}} ...

Can you do another backup to confirm?

You can retroactively apply hashed link key settings. I’ve done it. I’ll dig the commands up later

Does the issue only happen with that particular bulb? Although it very likely isn’t the issue, were you ever able to pair it via Bluetooth to your phone? Firmware version 1.76.11 is almost two years old and there should be newer firmware available. I wasn’t able to find the firmware for your bulb, as the image_type_id : 65535 you mentioned doesn’t exist. But you should be able to update it through the Philips Hue app using your phone.