connectedhomeip: [BUG] DNS-SD operational discovery timeouts in network with many services?
Reproduction steps
Matter Setup
- AppleTV as controller, Ethernet connection to LAN
- matter bridge app on Linux as accessory, Ethernet-only connection to LAN
Summary
The Bug is that this setup works 100% in a LAN with ~50 published DNS-SD services total (all types, not just matter), but completely fails operating in a LAN with ~120 published DNS-SD services.
Step 1: Commissioning in the LAN with ~120 DNS-SD services:
-
AppleTV and bridge app connected to LAN
-
Commission the bridge via an iPhone/iPad in the same LAN, everything works
-
Bridge/Devices appear in Home.app
-
Homekit passes operation to AppleTV, matter stack in AppleTV takes control, tries to re-discover bridge via DNS-SD, 100% fails like:
Discovery error 19:47:45.090502+0200 homed Mdns: Resolve failure (kDNSServiceErr_Timeout) Discovery error 19:47:45.090661+0200 homed OperationalSessionSetup[1:0000000042169F10]: operational discovery failed: ../../../../../../../../Sources/CHIPFramework/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:96: CHIP Error 0x00000032: Timeout DataManagement error 19:47:45.090863+0200 homed Failed to establish CASE for re-subscription with error '../../../../../../../../Sources/CHIPFramework/connectedhomeip/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:96: CHIP Error 0x00000032: Timeout'which then leads to
hmmtr.accessory.server error 19:47:45.431924+0200 homed [975063280/1] Bridged accessory for endpoint 3 is unreachable hmmtr.accessory.server error 19:47:45.432064+0200 homed [975063280/1] Read cache operation aborted for characteristic since the bridged accessory is unreachable: <HAPCharacteristic>which means all bridged devices immediately get status “not responding” and do never get operational.
Step 2: Move the setup to LAN with fewer DNS-SD services (~50)
- changing nothing in the setup, not even power down, just swap ethernet connection into a smaller test LAN with ~50 DNS-SD services.
- wait a little
- bridged devices get fully operational, and remaining so.
Step 3…n: Switch back an forth
- Without doing anything but plugging ethernet jacks to switch LANs, the setup can be made work and fail back and forth.
Bug prevalence
100% reproducible
GitHub hash of the SDK that was being used
9e371e6f191c15ac446e0cc2270a420458018cda
Platform
darwin
Platform Version(s)
16.3, 16.4, 16.5 beta
Anything else?
Notes:
- the relevant
_matter._tcpservices get visible reliably in both ~120 and ~50 services LANs likewise when observing withavahi-browse -a -tfrom linux based bridge itself as well as from another linux based system in the same LAN. - the relevant
_matter._tcpservice is also reliably visible in both ~120 and ~50 services LANs likewise when observing with Lily Ballard’sDiscoverynative macOS app (means: using macOS native DNS-SD). - Accessory/bridge side tested with 9e371e6f191c15ac446e0cc2270a420458018cda and v1.0.0
- Controller (AppleTV) side tested with tvOS 16.3, 16.4 and 16.5 beta 1.
Conclusion:
- As avahi and discovery app always see and resolve all services for ~50 and ~120 services cases, the publishing part (bridge accessory) cannot be at fault. It must be the DNS-SD browsing side in the controller not seeing the published services.
- Assuming same DNS-SD stack in macOS as in tvOS (@bzbarsky-apple: true?), this seems more likely a problem in the matter layers on top of darwin platform DNS-SD so that’s why I post the issue here.
- As the error is “timeout”, there might be hard-coded assumptions somewhere how long at most service discovery or address resolution might take, which might be value somehow dependent on the total number of services, leading to “giving up too early”.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 32 (19 by maintainers)
No worries, I’m happy to help and appreciate both the update and the information so far!
@Abhayakara JFYI: after a day of debugging: it could be a red herring - still I do not understand the root cause, but it seems likely now it is only indirectly dsn-sd activity related (a weirdness with the avahi setup on the test platform I haven’t figured out yet). Also, I’m not back at the clean test setup I had on Friday, because I cannot re-add that particular accessory in Home.app yet due to the unrelated ECDSA hash check problem. As soon as I have some more insight, I’ll update this issue.
Anyway, thanks a lot for helping, and sorry for having caused you worries…