connectedhomeip: [BUG] [ReadHandler] Regression in report engine caused by #28104

Reproduction steps

Commit c9ce7f21aff97aa0520ab778ae86d07af8bff507 from PR #28104 causes a regression in subscription handling, which is 100% reproducible in my (Linux Bridge) commissioned into a AppleTV:

  • build with c9ce7f21aff97aa0520ab778ae86d07af8bff507 -> controller endlessly retries subscribing
  • build with preceeding commit fe4c679ace3eb134d7365d3bd0fe6c2fad6ece87 -> subscriptions work ok

Bug prevalence

Always

GitHub hash of the SDK that was being used

c9ce7f21aff97aa0520ab778ae86d07af8bff507

Platform

core

Platform Version(s)

n/a

Anything else?

For some reason, the report engine keeps generating a IM:ReportData message that is rejected by the controller (AppleTV) with INVALID_SUBSCRIPTION. This causes the contoller to send a new IM:SubscribeRequest in a new session, which again causes the invalid IM:ReportData, forever (not good for the flash storage).

@bzbarsky-apple comments in #28104:

[…] Something is very broken here… This message:

[1690835066.750730][3483:3483] CHIP:EM: <<< [E:12134i S:7747 M:65373677] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)

is wrong. It’s happening before the subscription has been set up, but not on the subscription-setup exchange, which is why the other side is rejecting it: it carries a subscription id which is not for a subscription that “exists” yet.

Here’s a log excerpt of the repeating part: (full log as attachment)

....
[1690835059.326804][3483:3483] CHIP:EM: >>> [E:12133i S:7746 M:156214298 (Ack:190299662)] (S) 
[c9ce7f2 - reporting regression.txt](https://github.com/project-chip/connectedhomeip/files/12222101/c9ce7f2.-.reporting.regression.txt)
Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835059.327958][3483:3483] CHIP:IM: Received status response, status is 0x7d (INVALID_SUBSCRIPTION)

[1690835059.331730][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-eModLd)
[1690835059.480666][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835059.502327][3483:3483] CHIP:EM: <<< [E:12133i S:7746 M:190299665 (Ack:156214298)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835059.503050][3483:3483] CHIP:IN: (S) Sending msg 190299665 on secure session with LSID: 7746
[1690835059.504382][3483:3483] CHIP:EM: >>> [E:34561r S:7746 M:156214299 (Ack:190299663)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835059.505544][3483:3483] CHIP:IM: Received status response, status is 0x00 (SUCCESS)
[1690835059.506646][3483:3483] CHIP:EM: <<< [E:34561r S:7746 M:190299666 (Ack:156214299)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:04 (IM:SubscribeResponse)
[1690835059.507473][3483:3483] CHIP:IN: (S) Sending msg 190299666 on secure session with LSID: 7746
[1690835059.510134][3483:3483] CHIP:EM: >>> [E:34561r S:7746 M:156214300 (Ack:190299666)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835065.946340][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355875] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[1690835065.946659][3483:3483] CHIP:IN: CASE Server received Sigma1 message . Starting handshake. EC 0x6be5b0
[1690835065.946778][3483:3483] CHIP:SC: Received Sigma1 msg
[1690835065.949108][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618995 (Ack:49355875)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:33 (SecureChannel:CASE_Sigma2Resume)
[1690835065.949478][3483:3483] CHIP:IN: (U) Sending msg 132618995 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835065.952904][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355876 (Ack:132618995)] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:40 (SecureChannel:StatusReport)
[1690835065.953837][3483:3483] CHIP:SC: Success status report received. Session was established
[1690835065.956631][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-oKjjhg)
[1690835066.095636][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.097282][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-DkPbMl)
[1690835066.234423][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.236145][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-GnkEcB)
[1690835066.413213][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.413715][3483:3483] CHIP:SC: SecureSession[0x779419e0, LSID:7747]: State change 'kEstablishing' --> 'kActive'
[1690835066.413871][3483:3483] CHIP:IN: CASE Session established to peer: <146752179F0AE79A, 1>
[1690835066.414342][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618996 (Ack:49355876)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.414691][3483:3483] CHIP:IN: (U) Sending msg 132618996 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835066.416382][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.418659][3483:3483] CHIP:DMG: Registered a ReadHandler that will schedule a report between system Timestamp: 41430293 and system Timestamp 41430293.
[1690835066.423998][3483:3483] CHIP:DMG: Final negotiated min/max parameters: Min = 0s, Max = 2s
[1690835066.452039][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-BhAaJd)
[1690835066.745570][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.750730][3483:3483] CHIP:EM: <<< [E:12134i S:7747 M:65373677] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
[1690835066.751488][3483:3483] CHIP:IN: (S) Sending msg 65373677 on secure session with LSID: 7747
[1690835066.756634][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373678 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
[1690835066.757360][3483:3483] CHIP:IN: (S) Sending msg 65373678 on secure session with LSID: 7747
[1690835066.759274][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.760237][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373679 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.762392][3483:3483] CHIP:IN: (S) Sending msg 65373679 on secure session with LSID: 7747
[1690835066.763714][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355876 (Ack:132618995)] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:40 (SecureChannel:StatusReport)
[1690835066.764682][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618997 (Ack:49355876)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.765626][3483:3483] CHIP:IN: (U) Sending msg 132618997 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835066.767370][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.768331][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373680 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.769167][3483:3483] CHIP:IN: (S) Sending msg 65373680 on secure session with LSID: 7747
[1690835066.770940][3483:3483] CHIP:EM: >>> [E:12134i S:7747 M:113090552 (Ack:65373677)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835066.772102][3483:3483] CHIP:IM: Received status response, status is 0x7d (INVALID_SUBSCRIPTION)

[1690835066.776877][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-GJJBnm)
[1690835067.094920][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
.... repeating endlessly 

c9ce7f2 - reporting regression.txt

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (16 by maintainers)

Commits related to this issue

Most upvoted comments

which causes the first incremental report not to be sent, and no new timer set up.

That is definitely a bug in the scheduler.

Which is a reasonable assumption

No, it’s not. This assumption definitely does not hold on some platforms. Filed https://github.com/project-chip/connectedhomeip/issues/28929. This is a must-fix for 1.2, @lpbeliveau-silabs

@lpbeliveau-silabs I found the problem, finally 😅!

All of it was caused by the report timer firing a few milliseconds too early.

Yes, it had to do with my setup, which is that it uses a libev mainloop and timers (similar to how Darwin uses dispatch). So my fault, basically 😞

Here’s what happened (log from the Rpi bf0b45a build, with some extra instrumentation log lines (‼️ marked):

[2023-08-17 22:31:42.477 N] ✅ CHIP:EM  ➡️ <<< [E:5300r S:36346 M:90461106 (Ack:108617764)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
[2023-08-17 22:31:42.482 N] ✅ CHIP:IN  ➡️ (S) Sending msg 90461106 on secure session with LSID: 36346
[2023-08-17 22:31:42.485 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [AwaitingReportResponse]
[2023-08-17 22:31:42.487 D] 🛠️ CHIP:DMG ➡️ <RE> ReportsInFlight = 1 with readHandler 12, RE has no more messages
[2023-08-17 22:31:42.489 D] 🛠️ CHIP:DMG ➡️ All ReadHandler-s are clean, clear GlobalDirtySet
[2023-08-17 22:31:42.545 N] ✅ CHIP:EM  ➡️ >>> [E:5300r S:36346 M:108617765 (Ack:90461106)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[2023-08-17 22:31:42.547 D] 🛠️ CHIP:EM  ➡️ Found matching exchange: 5300r, Delegate: 0xb6bba0fc
[2023-08-17 22:31:42.548 D] 🛠️ CHIP:EM  ➡️ Rxd Ack; Removing MessageCounter:90461106 from Retrans Table on exchange 5300r
[2023-08-17 22:31:42.550 N] ✅ CHIP:IM  ➡️ Received status response, status is 0x00 (SUCCESS)
[2023-08-17 22:31:42.553 N] ✅ CHIP:EM  ➡️ <<< [E:5300r S:36346 M:90461107 (Ack:108617765)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:04 (IM:SubscribeResponse)
[2023-08-17 22:31:42.555 N] ✅ CHIP:IN  ➡️ (S) Sending msg 90461107 on secure session with LSID: 36346
[2023-08-17 22:31:42.561 N] ✅ CHIP:DMG ➡️ Registered a ReadHandler that will schedule a report between system Timestamp: 525396849 and system Timestamp 525398849.
[2023-08-17 22:31:42.564 D] 🛠️ CHIP:DMG ➡️ <RE> OnReportConfirm: NumReports = 0
[2023-08-17 22:31:42.565 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [CanStartReporting]
[2023-08-17 22:31:42.567 N] ✅ CHIP:DMG ➡️ ‼️ ReadHandlerNode::IsReportableNow()=0 : canstart=1, dirty=0, now=525396855, mMinTimestamp=525396849 (-6), mMaxTimestamp=525398849 (+1994), mSyncTimestamp=525398849 (+1994)
[2023-08-17 22:31:42.570 N] ✅ CHIP:DMG ➡️ ‼️ Starting Report Timer for node/context = 0x0xb6b40a10 with timeout = 1994
[2023-08-17 22:31:42.573 N] ✅ CHIP:EM  ➡️ >>> [E:5300r S:36346 M:108617766 (Ack:90461107)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[2023-08-17 22:31:42.574 D] 🛠️ CHIP:EM  ➡️ Found matching exchange: 5300r, Delegate: 0
[2023-08-17 22:31:42.576 D] 🛠️ CHIP:EM  ➡️ Rxd Ack; Removing MessageCounter:90461107 from Retrans Table on exchange 5300r
[2023-08-17 22:31:43.934 N] ✅ CHIP:DL  ➡️ Re-trying resolve
[2023-08-17 22:31:44.540 N] ✅ CHIP:DMG ➡️ ‼️ TimerFired() called on node = 0x0xb6b40a10
[2023-08-17 22:31:44.540 N] ✅ CHIP:DMG ➡️ ‼️ ReadHandlerNode::IsReportableNow()=0 : canstart=1, dirty=0, now=525398828, mMinTimestamp=525396849 (-1979), mMaxTimestamp=525398849 (+21), mSyncTimestamp=525398849 (+21)
[2023-08-17 22:31:44.541 D] 🛠️ CHIP:DMG ➡️ All ReadHandler-s are clean, clear GlobalDirtySet
[2023-08-17 22:31:48.947 N] ✅ CHIP:DL  ➡️ Re-trying resolve
[2023-08-17 22:31:49.004 N] ✅ CHIP:EM  ➡️ >>> [E:5301r S:0 M:202841598] (U) Msg RX from 0:FF325B00396BEFBC [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[2023-08-17 22:31:49.004 D] 🛠️ CHIP:EM  ➡️ Handling via exchange: 5301r, Delegate: 0x24a9e0
[2023-08-17 22:31:49.005 N] ✅ CHIP:IN  ➡️ CASE Server received Sigma1 message . Starting handshake. EC 0xb6cddd20
  • the timeout is calculated to be 1994 (“Starting Report Timer for node/context…”)
  • but the timer actually fires 21mS earlier already ("ReadHandlerNode::IsReportableNow()=0 … mMaxTimestamp=525398849 (+21))
  • which causes the first incremental report not to be sent, and no new timer set up.
  • so nothing happens until the AppleTV considers the subscription dead, and restarts with a new subscription.
  • at that time, ReadHandlerNode::IsReportableNow() of the still pending readhandler is called again, and generates the report that should have been triggered by the timer, many seconds too late.

So it boils down to the ReportEngine absolutely relying on timers NEVER firing even only a bit too early. Which is a reasonable assumption, and apparently true for other timer implementations, but not libev’s.

As adding libev was my contribution (#24232), I should have realized that. My bad, sorry. I just posted a fix: #28740

On the other hand, with timers always comes the question of tolerance. If implemenations may rely on absolutely none in the negative direction, it maybe should be stated in the header for System::Layer::StartTimer().