connectedhomeip: [BUG] [ReadHandler] Regression in report engine caused by #28104
Reproduction steps
Commit c9ce7f21aff97aa0520ab778ae86d07af8bff507 from PR #28104 causes a regression in subscription handling, which is 100% reproducible in my (Linux Bridge) commissioned into a AppleTV:
- build with c9ce7f21aff97aa0520ab778ae86d07af8bff507 -> controller endlessly retries subscribing
- build with preceeding commit fe4c679ace3eb134d7365d3bd0fe6c2fad6ece87 -> subscriptions work ok
Bug prevalence
Always
GitHub hash of the SDK that was being used
c9ce7f21aff97aa0520ab778ae86d07af8bff507
Platform
core
Platform Version(s)
n/a
Anything else?
For some reason, the report engine keeps generating a IM:ReportData message that is rejected by the controller (AppleTV) with INVALID_SUBSCRIPTION. This causes the contoller to send a new IM:SubscribeRequest in a new session, which again causes the invalid IM:ReportData, forever (not good for the flash storage).
@bzbarsky-apple comments in #28104:
[…] Something is very broken here… This message:
[1690835066.750730][3483:3483] CHIP:EM: <<< [E:12134i S:7747 M:65373677] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)is wrong. It’s happening before the subscription has been set up, but not on the subscription-setup exchange, which is why the other side is rejecting it: it carries a subscription id which is not for a subscription that “exists” yet.
Here’s a log excerpt of the repeating part: (full log as attachment)
....
[1690835059.326804][3483:3483] CHIP:EM: >>> [E:12133i S:7746 M:156214298 (Ack:190299662)] (S)
[c9ce7f2 - reporting regression.txt](https://github.com/project-chip/connectedhomeip/files/12222101/c9ce7f2.-.reporting.regression.txt)
Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835059.327958][3483:3483] CHIP:IM: Received status response, status is 0x7d (INVALID_SUBSCRIPTION)
[1690835059.331730][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-eModLd)
[1690835059.480666][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835059.502327][3483:3483] CHIP:EM: <<< [E:12133i S:7746 M:190299665 (Ack:156214298)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835059.503050][3483:3483] CHIP:IN: (S) Sending msg 190299665 on secure session with LSID: 7746
[1690835059.504382][3483:3483] CHIP:EM: >>> [E:34561r S:7746 M:156214299 (Ack:190299663)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835059.505544][3483:3483] CHIP:IM: Received status response, status is 0x00 (SUCCESS)
[1690835059.506646][3483:3483] CHIP:EM: <<< [E:34561r S:7746 M:190299666 (Ack:156214299)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:04 (IM:SubscribeResponse)
[1690835059.507473][3483:3483] CHIP:IN: (S) Sending msg 190299666 on secure session with LSID: 7746
[1690835059.510134][3483:3483] CHIP:EM: >>> [E:34561r S:7746 M:156214300 (Ack:190299666)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835065.946340][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355875] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[1690835065.946659][3483:3483] CHIP:IN: CASE Server received Sigma1 message . Starting handshake. EC 0x6be5b0
[1690835065.946778][3483:3483] CHIP:SC: Received Sigma1 msg
[1690835065.949108][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618995 (Ack:49355875)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:33 (SecureChannel:CASE_Sigma2Resume)
[1690835065.949478][3483:3483] CHIP:IN: (U) Sending msg 132618995 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835065.952904][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355876 (Ack:132618995)] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:40 (SecureChannel:StatusReport)
[1690835065.953837][3483:3483] CHIP:SC: Success status report received. Session was established
[1690835065.956631][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-oKjjhg)
[1690835066.095636][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.097282][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-DkPbMl)
[1690835066.234423][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.236145][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-GnkEcB)
[1690835066.413213][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.413715][3483:3483] CHIP:SC: SecureSession[0x779419e0, LSID:7747]: State change 'kEstablishing' --> 'kActive'
[1690835066.413871][3483:3483] CHIP:IN: CASE Session established to peer: <146752179F0AE79A, 1>
[1690835066.414342][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618996 (Ack:49355876)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.414691][3483:3483] CHIP:IN: (U) Sending msg 132618996 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835066.416382][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.418659][3483:3483] CHIP:DMG: Registered a ReadHandler that will schedule a report between system Timestamp: 41430293 and system Timestamp 41430293.
[1690835066.423998][3483:3483] CHIP:DMG: Final negotiated min/max parameters: Min = 0s, Max = 2s
[1690835066.452039][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-BhAaJd)
[1690835066.745570][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.750730][3483:3483] CHIP:EM: <<< [E:12134i S:7747 M:65373677] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
[1690835066.751488][3483:3483] CHIP:IN: (S) Sending msg 65373677 on secure session with LSID: 7747
[1690835066.756634][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373678 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
[1690835066.757360][3483:3483] CHIP:IN: (S) Sending msg 65373678 on secure session with LSID: 7747
[1690835066.759274][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.760237][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373679 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.762392][3483:3483] CHIP:IN: (S) Sending msg 65373679 on secure session with LSID: 7747
[1690835066.763714][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355876 (Ack:132618995)] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:40 (SecureChannel:StatusReport)
[1690835066.764682][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618997 (Ack:49355876)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.765626][3483:3483] CHIP:IN: (U) Sending msg 132618997 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835066.767370][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.768331][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373680 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.769167][3483:3483] CHIP:IN: (S) Sending msg 65373680 on secure session with LSID: 7747
[1690835066.770940][3483:3483] CHIP:EM: >>> [E:12134i S:7747 M:113090552 (Ack:65373677)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835066.772102][3483:3483] CHIP:IM: Received status response, status is 0x7d (INVALID_SUBSCRIPTION)
[1690835066.776877][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-GJJBnm)
[1690835067.094920][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
.... repeating endlessly
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (16 by maintainers)
Commits related to this issue
- Use matter SDK without problematic c9ce7f2 - [ReadHandler] Integration of ReportScheduler into the ReadHandler and IM engine. (#28104) - this commit causes endless re-subscriptions and thus flash wea... — committed to plan44/p44mbrd by plan44 a year ago
- SystemLayerImplSelect: libev: avoid timers firing slightly too early (#28434) This fixes a ReportEngine problem that was caused by libev-based timers firing slightly (in the range of 20mS) too early,... — committed to plan44/connectedhomeip by plan44 10 months ago
- [PR pending] SystemLayerImplSelect: libev: avoid timers firing slightly too early (#28434) This fixes a ReportEngine problem that was caused by libev-based timers firing slightly (in the range of 20m... — committed to plan44/connectedhomeip by plan44 10 months ago
- SystemLayerImplSelect: libev: avoid timers firing slightly too early (#28434) This fixes a ReportEngine problem that was caused by libev based timers firing slightly (in the range of 20mS) too early,... — committed to plan44/connectedhomeip by plan44 10 months ago
- [PR pending] SystemLayerImplSelect: libev: avoid timers firing slightly too early (#28434) This fixes a ReportEngine problem that was caused by libev-based timers firing slightly (in the range of 20m... — committed to plan44/connectedhomeip by plan44 10 months ago
- SystemLayerImplSelect: libev: avoid timers firing slightly too early (#28434) This fixes a ReportEngine problem that was caused by libev based timers firing slightly (in the range of 20mS) too early,... — committed to plan44/connectedhomeip by plan44 10 months ago
- SystemLayerImplSelect: libev: avoid timers firing early (#28434) (#28740) * SystemLayerImplSelect: libev: avoid timers firing slightly too early (#28434) This fixes a ReportEngine problem that was c... — committed to project-chip/connectedhomeip by plan44 10 months ago
- SystemLayerImplSelect: libev: avoid timers firing early (#28434) (#28740) * SystemLayerImplSelect: libev: avoid timers firing slightly too early (#28434) This fixes a ReportEngine problem that was c... — committed to HunsupJung/connectedhomeip by plan44 10 months ago
That is definitely a bug in the scheduler.
No, it’s not. This assumption definitely does not hold on some platforms. Filed https://github.com/project-chip/connectedhomeip/issues/28929. This is a must-fix for 1.2, @lpbeliveau-silabs
@lpbeliveau-silabs I found the problem, finally 😅!
All of it was caused by the report timer firing a few milliseconds too early.
Yes, it had to do with my setup, which is that it uses a libev mainloop and timers (similar to how Darwin uses dispatch). So my fault, basically 😞
Here’s what happened (log from the Rpi bf0b45a build, with some extra instrumentation log lines (‼️ marked):
ReadHandlerNode::IsReportableNow()of the still pending readhandler is called again, and generates the report that should have been triggered by the timer, many seconds too late.So it boils down to the ReportEngine absolutely relying on timers NEVER firing even only a bit too early. Which is a reasonable assumption, and apparently true for other timer implementations, but not libev’s.
As adding libev was my contribution (#24232), I should have realized that. My bad, sorry. I just posted a fix: #28740
On the other hand, with timers always comes the question of tolerance. If implemenations may rely on absolutely none in the negative direction, it maybe should be stated in the header for
System::Layer::StartTimer().