chirpstack-gateway-bridge: Subscriptions are not obtained upon Mqtt disconnects / or connection resets

  • [x The issue is present in the latest release.
  • I have searched the issues of this repository and believe that this is not a duplicate.

What happened?

Some unexpected behaviors when connections are interupted to the MQTT broker (EMQX v4.1.0 in my case).

Inconsistently following a connection error like:

time="2020-10-22T17:11:09Z" level=error msg="mqtt: connection error" error="read tcp IP:38074->IP:31709: read: connection reset by peer"

time="2020-10-22T18:36:01Z" level=error msg="mqtt: connection error" error=EOF

time="2020-10-22T18:36:17Z" level=error msg="mqtt: connection error" error="write tcp IP:33194->IP:31709: write: broken pipe"

The subscriptions for each gateway topic to MQTT broker are not re-subscribed. When I enabled debug and the paho logging, I could see the subscriptions attempt to be re-added, but it would log this line and then never log it for the 2nd gateway not obtain either subscription.

What did you expect?

I would expect the subscriptions to be re-obtained. This may be a bug with the paho client.

I am currently testing gateway bridge at master + paho client at master.

Steps to reproduce this issue

Steps:

Could you share your log output?


Your Environment

Component Version
Application Server v?.?.?
Network Server
Gateway Bridge master
Chirpstack API
Geolocation
Concentratord

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 23 (14 by maintainers)

Commits related to this issue

Most upvoted comments

I deployed an EMQX cluster after I read about it in this thread and will test it, too. 😃

Thanks @JohnRoesler for testing this!

With how the timing needed to line up perfectly for the mutex to lock up.

For this reason, I have split up the mutex into different variables. It was used and for connection purposes, and to guard against concurrent access to the gateways map.

Fyi: I have just merged in some other improvements, during an other project we have found some bottlenecks in how the channels were setup. These channels have been removed and callbacks executed in go routines are now used. This does not solve this issue however, I’m currently testing various scenarios and I might have found the potential issue. I’m currently testing some modifications.

Note that in the original implementation, the mutex might not be the issue. While SetGatewaySubscription would hold the lock until the (un)subscribe is completed, this function should finish once re-connected. This means that onConnected will be able to acquire the lock only after SetGatewaySubscription has been completed, but that is fine. The potential race is with connect, as this also tries to acquire a lock. So if SetGatewaySubscription is not able to (un)subscribe because the client is disconnected, then connect is blocked forever and there is a deadlock.

I’m going to make some modifications and let you know as soon as I have something to test with 😃