connectedhomeip: [BUG] DNS-SD operational discovery timeouts in network with many services?

Reproduction steps

Matter Setup

  • AppleTV as controller, Ethernet connection to LAN
  • matter bridge app on Linux as accessory, Ethernet-only connection to LAN

Summary

The Bug is that this setup works 100% in a LAN with ~50 published DNS-SD services total (all types, not just matter), but completely fails operating in a LAN with ~120 published DNS-SD services.

Step 1: Commissioning in the LAN with ~120 DNS-SD services:

  • AppleTV and bridge app connected to LAN

  • Commission the bridge via an iPhone/iPad in the same LAN, everything works

  • Bridge/Devices appear in Home.app

  • Homekit passes operation to AppleTV, matter stack in AppleTV takes control, tries to re-discover bridge via DNS-SD, 100% fails like:

    Discovery	error	19:47:45.090502+0200	homed	Mdns: Resolve failure (kDNSServiceErr_Timeout)
    Discovery	error	19:47:45.090661+0200	homed	OperationalSessionSetup[1:0000000042169F10]: operational discovery failed: ../../../../../../../../Sources/CHIPFramework/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:96: CHIP Error 0x00000032: Timeout
    DataManagement	error	19:47:45.090863+0200	homed	Failed to establish CASE for re-subscription with error '../../../../../../../../Sources/CHIPFramework/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:96: CHIP Error 0x00000032: Timeout'
    

    which then leads to

    hmmtr.accessory.server	error	19:47:45.431924+0200	homed	[975063280/1] Bridged accessory for endpoint 3 is unreachable
    hmmtr.accessory.server	error	19:47:45.432064+0200	homed	[975063280/1] Read cache operation aborted for characteristic since the bridged accessory is unreachable: <HAPCharacteristic>
    

    which means all bridged devices immediately get status “not responding” and do never get operational.

Step 2: Move the setup to LAN with fewer DNS-SD services (~50)

  • changing nothing in the setup, not even power down, just swap ethernet connection into a smaller test LAN with ~50 DNS-SD services.
  • wait a little
  • bridged devices get fully operational, and remaining so.

Step 3…n: Switch back an forth

  • Without doing anything but plugging ethernet jacks to switch LANs, the setup can be made work and fail back and forth.

Bug prevalence

100% reproducible

GitHub hash of the SDK that was being used

9e371e6f191c15ac446e0cc2270a420458018cda

Platform

darwin

Platform Version(s)

16.3, 16.4, 16.5 beta

Anything else?

Notes:

  • the relevant _matter._tcp services get visible reliably in both ~120 and ~50 services LANs likewise when observing with avahi-browse -a -t from linux based bridge itself as well as from another linux based system in the same LAN.
  • the relevant _matter._tcp service is also reliably visible in both ~120 and ~50 services LANs likewise when observing with Lily Ballard’s Discovery native macOS app (means: using macOS native DNS-SD).
  • Accessory/bridge side tested with 9e371e6f191c15ac446e0cc2270a420458018cda and v1.0.0
  • Controller (AppleTV) side tested with tvOS 16.3, 16.4 and 16.5 beta 1.

Conclusion:

  • As avahi and discovery app always see and resolve all services for ~50 and ~120 services cases, the publishing part (bridge accessory) cannot be at fault. It must be the DNS-SD browsing side in the controller not seeing the published services.
  • Assuming same DNS-SD stack in macOS as in tvOS (@bzbarsky-apple: true?), this seems more likely a problem in the matter layers on top of darwin platform DNS-SD so that’s why I post the issue here.
  • As the error is “timeout”, there might be hard-coded assumptions somewhere how long at most service discovery or address resolution might take, which might be value somehow dependent on the total number of services, leading to “giving up too early”.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 32 (19 by maintainers)

Most upvoted comments

No worries, I’m happy to help and appreciate both the update and the information so far!

@Abhayakara JFYI: after a day of debugging: it could be a red herring - still I do not understand the root cause, but it seems likely now it is only indirectly dsn-sd activity related (a weirdness with the avahi setup on the test platform I haven’t figured out yet). Also, I’m not back at the clean test setup I had on Friday, because I cannot re-add that particular accessory in Home.app yet due to the unrelated ECDSA hash check problem. As soon as I have some more insight, I’ll update this issue.

Anyway, thanks a lot for helping, and sorry for having caused you worries…