thingsboard: [Bug] CoAP observation drops

Bug CoAP observation relationship is “lost” ~1min after being established. While trying to Observe coap://host/api/v1/$ACCESS_TOKEN/rpc

Server Confirmed on

  • demo.thingsboard.io and
  • ThingsBoard PE 3.3.0 running on Ubuntu 20.04.3 (Docker monolith)

Your Device

  • Connectivity
    • CoAP Reproducible with “coap-client” from libcoap 4.3.0 running on linux (Ubuntu 20.04.3)

To Reproduce Steps to reproduce the behavior:

  1. Using the sample widget “send rpc” on a dashboard. Attempt to send an RPC command to coap-client, which is subscribed by launching the process with the following syntax: ./coap-client -m get coap://demo.thingsboard.io/api/v1/$ACCESS_TOKEN/rpc -s 720 -B 720
  2. Click on “send rpc”. Which should result in the following output being printed (stdout) by the coap-client process: {"id":1,"method":"rpcCommand","params":{}}
  3. Wait ~1min or longer and click “send rpc”. This command will never reach the coap-client.

Expected behavior CoAP observation should not be dropped without notifying the observer.

Relevant logs 2021-09-21 13:37:38,274 [DefaultTransportService-22-6] INFO o.e.californium.core.CoapResource - successfully established observe relation between 172.19.0.1:36342#BEEFFEED and resource /api/v1 (Exchange[R1132], size 33)

after failing to send RPC to device

2021-09-21 13:45:06,488 [CoapServer(main)#2] INFO o.e.c.c.network.stack.ObserveLayer - notification for token [Token=BEEFFEED] timed out. Canceling all relations with source [/172.19.0.1:36342] 2021-09-21 13:45:06,489 [CoapServer(main)#2] INFO o.e.californium.core.CoapResource - remove observe relation between 172.19.0.1:36342#BEEFFEED and resource /api/v1 (Exchange[R1132, complete], size 32) 2021-09-21 13:45:06,489 [CoapServer(main)#2] ERROR o.e.c.c.n.stack.ReliabilityLayer - Exception for Exchange[R1132, complete] in MessageObserver: null java.lang.NullPointerException: null at org.thingsboard.server.transport.coap.client.DefaultCoapClientContext.cancelRpcSubscription(DefaultCoapClientContext.java:741) at org.thingsboard.server.transport.coap.client.DefaultCoapClientContext.deregisterObserveRelation(DefaultCoapClientContext.java:176) at org.thingsboard.server.transport.coap.CoapTransportResource$CoapResourceObserver.removedObserveRelation(CoapTransportResource.java:504) at org.eclipse.californium.core.CoapResource.removeObserveRelation(CoapResource.java:778) at org.eclipse.californium.core.observe.ObserveRelation.cancel(ObserveRelation.java:151) at org.eclipse.californium.core.observe.ObservingEndpoint.cancelAll(ObservingEndpoint.java:74) at org.eclipse.californium.core.observe.ObserveRelation.cancelAll(ObserveRelation.java:162) at org.eclipse.californium.core.network.stack.ObserveLayer$NotificationController.onTimeout(ObserveLayer.java:233) at org.eclipse.californium.core.coap.Message.setTimedOut(Message.java:954) at org.eclipse.californium.core.network.Exchange.setTimedOut(Exchange.java:707) at org.eclipse.californium.core.network.stack.ReliabilityLayer$RetransmissionTask.retry(ReliabilityLayer.java:524) at org.eclipse.californium.core.network.stack.ReliabilityLayer$RetransmissionTask.access$200(ReliabilityLayer.java:430) at org.eclipse.californium.core.network.stack.ReliabilityLayer$RetransmissionTask$1.run(ReliabilityLayer.java:467) at org.eclipse.californium.elements.util.SerialExecutor$1.run(SerialExecutor.java:289) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

Additional context Somewhat noteworthy is the fact that monitoring the packets in/out we could not observe any outgoing packets for the failed RPC to device. Additionally, this interaction is on IPv6, and yet the log shows the observation being mapped to IPv4.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 15 (4 by maintainers)

Most upvoted comments

Hi @jairohg , I will provide more comments about this issue tomorrow morning. This is indeed related to NAT but not only to NAT. It is also about the routing tables on many load balancers. It is a long story but we have a solution. Stay tuned for updates.

Hi @WillNilges , my 2 cents about our “coap.thingsboard.cloud” setup: at the moment the LB is installed on AWS Ubuntu VMs with elastic IPs. It forwards the traffic to LwM2M pods using Node Port. The LB is “remembering” the routing table, which consists of: A) Source IP and port of the device. B) Dest IP and port of the Node; The LB is configured to remember the sessions for 1 hour. So, when the Node has an update, we make sure we push it from the correct LB IP and Port, and not from the AWS NAT Gateway.

Before the LB, we were still publishing the update from the node, but it was sent from the wrong IP (not from the LB IP, which received the packet, but from the AWS NAT Gateway). The client was ignoring the update since the IP of the originator of the CoAP/UDP packet was different.

The ThingsBoard UDP LB is only 8 days old. Is it ready for production use?

We have faced a similar issue during testing of some NB IoT devices which communicate over CoAP and LwM2M. Even if the device is not behind the NAT, the issue with the Load Balancer implementation is present.

AWS NLB During our experiments, we have found that the default AWS NLB implementation “remembers” the route between the client ip+port and the k8s instance for 2 minutes. After ~2-3 minutes of inactivity, whatever downlink we send to the client, is not delivered. We have not found any configuration property for the lifetime of the session that works for our case. In an attempt to solve the problem, we decided to try Nginx LB which has many configuration options.

Nginx LB We have allocated Nginx LB with the static IP address and forwarded traffic to our K8S services in the same VPC that are exposed as NodePorts. This solution allows you to configure the “session” expiration time or route table to 24 and more hours. This works well until your target servers (ThingsBoard CoAP Transport) are restarted due to an upgrade or outage. In such a case the Load Balancer should forward the traffic for existing “sessions” to the new server. Unfortunately, this was not the case. The traffic was forwarded for new but not old clients. Old clients traffic was forwarded to the wrong server and was never delivered. Theoretically, it is possible to configure Nginx to close the session using proxy_responses setting and close the connections using proxy_session_drop which is part of commercial subscription btw. But for our use case, these settings are not perfect for multiple reasons:

  1. Clients may want to use non-confirmable packets for telemetry to optimize traffic consumption. In such a case, the proxy_responses setting does not work for us. The Load Balancer simply does not know should it close the session or not.
  2. The amount of traffic should not increase due to server failures or restarts. That is why the only possible solution we found is to create our own UDP Load Balancer. It may be not as fast as Nginx or competitors, but the amount of traffic from the NB IoT devices is relatively low. We just need more control over “session” management.

ThingsBoard UDP LB Simple LB implementation which is open-source and powered by great frameworks. It has certain limitations but works best for our use case. A single server may handle up to 50K concurrent UDP sessions (since each session occupies one client port on the load balancer). The server reacts to the DNS address changes and updates the routing table and the data is forwarded to a new server(s). LwM2M transport remembers all active sessions so the clients are not affected. What we plan to do next:

  1. Cache the routing table in Redis or local files. So the restart of the load balancer will not affect our setup.
  2. Cache the subscriptions info on the ThingsBoard CoAP Transport. Same as for LwM2M. To remember the RPC/Attirubte subscriptions for the client. This way we optimize the traffic and the devices may be super constrained without any complex logic about reconnects/etc.