lorawan-stack: Basic Station Integration: Race condition in re-connection handling causes permanent failure of uplink forwarding
Summary
The Basic Station protocol is based on TCP. Occasionally it may happen, that the client drops this connection without executing a clean TCP connection close sequence. This may occur if link/net layer connectivity suddenly disappears and causes the TCP layer to reset and retry (e.g. a gateway is switching from ethernet backhaul to 3G backhaul because ethernet went away; or gateway is seeing unexpected power cycle and boots up quickly - a common unplug/plugin scenario for TTIG). If Basic Station establishes a new connection within a certain time after the last uplink for the old connection was forwarded, the LNS stops processing uplinks from this gateway permanently. It looks like this timeout is about 60 seconds. Given the symptoms and the fact that this does not happen on the v3 stack, it looks like this issue is related to #1729.
This issue has also been discussed in the TTN forums
Steps to Reproduce
- To simulate unclean TCP connection termination, introduce an iptables rule which blocks TCP FIN packets:
iptables -A OUTPUT -d 52.169.76.203 --protocol tcp --tcp-flags FIN FIN -j DROP - Start Basic on the machine where the iptables rule is active. Make sure you are in an environment of regular uplinks (every 10 seconds or so). Observe
https://console.thethingsnetwork.org/gateways/<GATEWAYID>/trafficfor incoming traffic. - After a few uplinks are forwarded, stop the station process with CTRL+C (the server will see an unclean TCP termination, because of the missing FIN packet). Let us define the time
Tas the time where the last uplink message was forwarded before the station process was killed. - Shortly after, start the station process again (Basic Station will connect and forward uplinks).
- At time
T + 60 s, the error condition kicks in.
What do you see now?
The error condition: The gateway console https://console.thethingsnetwork.org/gateways/<GATEWAYID>/traffic will stop showing uplinks while the connection between Basic Station and the Server as kept alive (TCP keep alive messages are exchanged) and Basic Station continues to receive uplinks. A TCP/IP packet capture shows that the uplinks are actually transferred over the websocket and the TCP packets are acknowledged by the server - i.e. the server definitely receives the uplink messages but does not show them in the gateway console.
What do you want to see instead?
Uplink messages should continue to be processed by the LNS.
Environment
Basic Station (latest version), TTN community network.
(On the v3 stack this does not happen - hence the suspicion that it has something to do with the inactive connection termination heuristic).
How do you propose to implement this?
This is hard to judge without having access to the code. From the symptoms it looks like this issue is tied to the inactive connection termination heuristic as discussed in issue #1729. Probably the server is seeing two connections because a new connection is established before the old one is cleanly closed. Maybe, the connection termination heuristic detects an inactive connection on the ‘dead’ connection and destroys context related to the gateway without considering that a second connection requires this context to forward uplinks up the stack. Obviously this is a guess, but it could explain the symptoms.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 16 (4 by maintainers)
Hi Krishna, the issue is not related to server-side disconnects but to the way the server handles unclean client side disconnects with immediate reconnects. The issue does affect all Basic Station based gateways and can be reliably reproduced with the instructions above.
The error pattern is like this:
Increasing the timeout from 60s to 600s made the issue more severe: With 60s, the user, who unplugs his TTIG just had to wait for 60s before plugging it back in to avoid running into the issue. This may have happened often enough especially when people do a factory reset. Now with 600s, a larger percentage will run into the issue, which is also what ca be observed.
Would it hurt to deactivate this timeout altogether? Basic Station does TCP keep alive by default (https://github.com/lorabasics/basicstation/blob/c29b8502f8c715daecec6666835da6e981dc820a/src/sys.c#L637). Doesn’t that suffice to check connection aliveness?
We’re releasing v3.5.3 today containing the fix, and are likely be able to deploy that to the servers where TTIG connects to. Hopefully that will be resolved very soon.
The device i am monitoring transmits every 5 minutes.
02:42 RX Frame 02:47 RX Frame 02:53 RX Frame 02:56 Reconnect 02:58 RX Frame no frames 07:30 Unplug/replug TTIG 07:36 RX Frame