nats.deno: Suggestion: Possibility for higher resilience for connect-once, never-disconnects use cases.

Hi,

This is not an issue but a suggestion, or question maybe, and posting here instead of nats.ws as this seems to related to some logic within base-client.

We’re using nats.js / nats.ws in unstable edge-networks (powered by 4G or 3G), the application connects & subs once, and is expected to stay connected forever (and reconnects quickly)

Thus: a relatively aggressive pinging / reconnecting config is used here:

{
    noEcho: true,
    noRandomize: true,
    maxReconnectAttempts: -1,
    waitOnFirstConnect: true,
    reconnectTimeWait: 500,
    pingInterval: 3 * 1000, 
    maxPingOut: 3
}

Lately we’ve noticed when devices doing PUBs, sometimes the client will be in closed state, which causes error, and I’m sure there’s no application logic that close() the client. It seems that _close() is being called from the inside.

In this state, all subs are lost as close() cleans those up, and heartbeat also stops… If the dev wants to create a never-ending connection, he or she must re-init the connection manually, and re-init all subs manually which introduces a ‘loop like’ structure.

async function init() {
    await connect(); 
    //..all subs
}

on("closed", init);

This forces all subs into a function, or will introduce event-emitter alike structure into end-users’ code.

My suggestion here is, if reconnect == -1 which suggests ‘it never ends’, can we keep the client instance intact no matter what? & keep trying and pinging until the end of application? This will simplify edge-app complexity.

For now, we’re monkey patching _close

io.protocol._close = async () => {
        //it never dies!
        reset("Underlying structure breaks");
};

var cc = io.protocol.heartbeats.cancel.bind(io.protocol.heartbeats);
io.protocol.heartbeats.cancel = (stale) => {
   if (stale) {
      cc(stale);
      if (io.protocol.connected) {
          reset("Protocol hidden crash!");
      }
   }
   else {
       cc(stale);
   }
}


var reset_busy = false;
async function reset(e) {
    try {
        if (reset_busy) {
            console.warn("io busy", e);
            return;
        }
        reset_busy = true;
        try {
            io.protocol.transport.close();
        }
        catch (e) {
        }
        io.protocol.prepare();
        await io.protocol.dialLoop();
        io._closed = false;
        io.protocol._closed = false;
        reset_busy = false;
    } catch (e) {
        console.error(e);
    }
}

Above are some horrible logic, but keeps the problem away…

Thank you for your attention.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 19 (8 by maintainers)

Most upvoted comments

I will be doing a release very soon now (but it may be a few days), but I am wondering if the issue is related to this:

https://github.com/nats-io/nats.deno/pull/201

It was possible for the client start processing a partial frame during connect, which meant that it would fail during connection, because the full connect JSON was not available even if it was expected to be there.

If you are willing to try, I can release a beta that you can try and see if the issue persists.

All the clients were updated: https://github.com/nats-io/nats.ws/pull/114 https://github.com/nats-io/nats.js/pull/456

aricart on Oct 18, 2021

I’m not sure if this is 100% related but the issues you’re having could be related to the waitOnFirstConnect option. I’ve been experiencing similar issues with durable connections on some services we run which can take a bit of time to initialize and start accepting connections.

If the service exceeds the configured timeout (20s default), it looks like the initial connection will occur and everything will work as expected for upwards of 30min. However, eventually we end up seeing a cycle of:

Nats disconnected
Nats reconnecting
Nats connecting
Nats reconnect

Followed shortly after by a Nats Protocol Error which causes our services to cycle. After removing the waitOnFirstConnect option and upping the timeout so our services have enough time to initialize i’m no longer seeing the same instability as before. Very hard issue to pinpoint as its only on some services and appears to be somewhat variable.

gitrojones on Oct 11, 2021