stan.net: Cannot re-connect to server after nats-streaming-server is restarted

I’ve encountered the following issue while trying to reconnect to nats-streaming-server. Here are the steps to reproduce:

  1. nats-streaming-server is running
  2. the app (see demo code below) connects to nats-streaming-server and starts publishing messages
  3. the nats-streaming-server is killed
  4. on each subsequent Publish, STAN.Client.StanConnectionException: The NATS connection is reconnecting is printed
  5. after a while (server still not up), on each Publish, the app starts to receive STAN.Client.StanConnectionClosedException: Connection closed.
  6. I restart nats-streaming-server
  7. The app’s ReconnectedEventHandler is fired, detecting nats connection restore.
  8. But the app’s StanConnection never recovers and the application can never resume successful publishing.

What can I do in order to allow the app to recover publishing?

Running on OSX Catalina and using latest nuget packages available at the current time:

    <ItemGroup>
        <PackageReference Include="NATS.Client" Version="0.10.0" />
        <PackageReference Include="STAN.Client" Version="0.2.0" />
    </ItemGroup>

Code used for testing:

using System;
using System.Diagnostics;
using System.Threading;
using NATS.Client;
using STAN.Client;

namespace StanTest
{
    class Program
    {
        static void Main(string[] args)
        {
            var cf = new ConnectionFactory();
            var sf = new StanConnectionFactory();
            var natsConnection = cf.CreateConnection(GetOpts());

            var stanOpts = StanOptions.GetDefaultOptions();
            stanOpts.ConnectTimeout = 4000;
            stanOpts.NatsConn = natsConnection;
            stanOpts.PubAckWait = 40000;

            var stanConnection = sf.CreateConnection("test-cluster", "uniq123", stanOpts);
            var watch = Stopwatch.StartNew();

            while (true)
            {
                try
                {
                    stanConnection.Publish("test", new byte[0x1]);
                    Console.WriteLine("{0}. Published message", watch.Elapsed);
                }
                catch (Exception e)
                {
                    Console.WriteLine("{0} - Type:{1}. On Publish: exception message: {2}", watch.Elapsed, e.GetType(), e.Message);
                }
                finally
                {
                    Thread.Sleep(1000);
                }
            }
        }
        
        private static Options GetOpts()
        {
            var opts = ConnectionFactory.GetDefaultOptions();
            opts.Url = "nats://localhost:4222";
            opts.AllowReconnect = true;
            opts.PingInterval = 5000;
            opts.MaxPingsOut = 2;
            opts.MaxReconnect = Options.ReconnectForever;
            opts.ReconnectWait = 1000;
            opts.Timeout = 4000;

            opts.ServerDiscoveredEventHandler += (sender, args) => Console.WriteLine("NATS server discovered");

            opts.ReconnectedEventHandler +=
                (sender, args) =>  Console.WriteLine( "NATS server reconnected.");
            opts.ClosedEventHandler +=
                (sender, args) => Console.WriteLine("NATS connection closed");
            opts.DisconnectedEventHandler += (sender, args) =>
                Console.WriteLine("NATS connection disconnected");
            opts.AsyncErrorEventHandler +=
                (sender, args) => Console.WriteLine("NATS async error: {0}, Message={1}, Subject={2}", args.Conn.ConnectedUrl,
                    args.Error, args.Subscription.Subject);

            return opts;
        }
    }   
}

And here are the messages printed to console while running:

/usr/local/share/dotnet/dotnet /Users/robert/Sandbox/StanTest/bin/Debug/netcoreapp3.0/StanTest.dll
00:00:00.0237181. Published message
00:00:01.0443621. Published message
NATS connection disconnected
00:00:02.0483280 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:03.0528216 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:04.0580304 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:05.0612458 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:06.0655462 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:07.0696808 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:08.0721798 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:09.0763581 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:10.0803322 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:11.0852381 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:12.0898777 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:13.0904445 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:14.0946184 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:15.0953681 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:16.0963306 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:17.1017706 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:18.1047256 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:19.1077148 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:20.1114949 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:21.1162681 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:22.1206395 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:23.1238276 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
NATS server reconnected.
00:00:24.1287867 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:25.1324792 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:26.1373635 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:27.1406061 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:28.1452916 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:29.1501845 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
^C

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 30 (13 by maintainers)

Most upvoted comments

I have encountered the same situation in version 0.3.0. when my Stan connection entered to Connection Closed Mode it never comes to connect state and all publish requests throw the Connection Close Exception.

this is my code to register stan Connection:

public static class Extensions
    {
        private static IStanConnection _stanConnection; 
        public static IServiceCollection AddStanClient(this IServiceCollection services, StanSettings settings)
        {

            var stanOptions = StanOptions.GetDefaultOptions();
            var natsOptions = ConnectionFactory.GetDefaultOptions();
            var stanClientId = $"{settings.ClientId}-{Guid.NewGuid()}";
            var stanClusterId = settings.ClusterId;
            natsOptions.Servers = settings.NatsServers.ToArray();
            natsOptions.AllowReconnect = true;
            natsOptions.MaxReconnect = Options.ReconnectForever;
            natsOptions.PingInterval = 2000;
            natsOptions.MaxPingsOut = 2;
            natsOptions.ReconnectWait = 1000;
            natsOptions.ReconnectBufferSize = Options.ReconnectBufferSizeUnbounded;
            natsOptions.Timeout = 4000;
             
            natsOptions.ClosedEventHandler += (object sender, ConnEventArgs arg) =>
            {
                Console.WriteLine($"connection lost on connection {arg.Conn.ConnectedId}" +
                                  $"{Environment.NewLine} {arg.Error.StackTrace}");
            };

            natsOptions.ReconnectedEventHandler += (object sender, ConnEventArgs arg) =>
            {
                stanOptions.NatsConn = arg.Conn;
                Console.WriteLine($"connection Recovered by connection id {arg.Conn.ConnectedId}");
            }; 
            
            
            stanOptions.NatsConn = new ConnectionFactory().CreateConnection(natsOptions);
            stanOptions.ConnectionLostEventHandler += (object sender, StanConnLostHandlerArgs arg) => {
                Console.WriteLine($"connection lost on connection {arg.Connection.NATSConnection.ConnectedId}" +
                                  $"{Environment.NewLine} {arg.ConnectionException.StackTrace}"); 
            };

            _stanConnection = new StanConnectionFactory().CreateConnection(stanClusterId, stanClientId, stanOptions); 
            services.AddSingleton<IStanConnection>( _stanConnection);

            return services;
        }
    }

and I use this connection like this :

static Task Main(string[] args)
        {
            var host = CreateHostBuilder(args).Build();

            var scope = host.Services.CreateScope();
            

            var watch = new Stopwatch();
            watch.Start();
            while (true)
            {
                var stanConnection = scope.ServiceProvider.GetRequiredService<IStanConnection>();
                try
                {
                    stanConnection.Publish("test", new byte[0x1]);
                    Console.WriteLine("{0}. Published message", watch.Elapsed);
                }
                catch (Exception e)
                {
                    Console.WriteLine("{0} - Type:{1}. On Publish: exception message: {2}", watch.Elapsed, e.GetType(), e.Message);
                }
                finally
                {
                    Thread.Sleep(1000);
                }
            }
            Console.WriteLine("Hello World!");

            return host.StartAsync();
        }


        static IHostBuilder CreateHostBuilder(string[] args) =>
            Host.CreateDefaultBuilder(args)
                .ConfigureServices((_, services) =>
                    services.AddStanClient(new StanSettings
                    {
                        ClientId = "uniq123",
                        ClusterId = "test-cluster",
                        NatsServers = new List<string>{"nats://localhost:4222"}
                    })); 

    }
    ```
    and this is the output of my code when I stop the Stan service and after a while, I start it again :
    
00:00:13.7476360. Published message
00:00:14.7542784. Published message
00:00:15.7602945. Published message
00:00:16.7658160. Published message
00:00:17.7862428. Published message
00:00:18.8041595. Published message
00:00:20.6005542 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:21.6487055 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:22.7012163 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:23.7444252 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:24.7905985 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:25.8348325 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:26.8816522 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:27.9295129 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:28.9737369 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:30.0154864 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:31.0607351 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:32.1057689 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:33.1495434 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:34.1919119 - Type:STAN.Client.StanConnectionException. On Publish: exception message: The NATS connection is reconnecting
00:00:35.2404412 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:36.3016650 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:37.3464719 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:38.3887135 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:39.4637382 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:40.5152720 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:41.5576260 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
connection Recovered by connection id NDIVMUUHKJN644WRMAFRWW7GB5TD3ZPYVE2CIJWFMCLJDXCWIBDURO6L
00:00:42.6050295 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:43.6454686 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:44.6864267 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.
00:00:45.7310466 - Type:STAN.Client.StanConnectionClosedException. On Publish: exception message: Connection closed.

Many thanks for your time spent on this issue! Is kind of odd and I can never imagine a practical situation in which I would not like that a STAN client to recover when ping failed but the enclosed NATS connection is alive. I was led by the impression that STAN connection should be long lived (and act as a singleton) similar to NATS connection object. In my prod app, I was registering the STAN connection as a singleton with the DI container. So, to conclude: you are telling me that:

Either:

  1. disable (set very high timeout/no of pings) ping on STAN connection, but leave it on for NATS connection
  2. or, recreate STAN connection every time when the underlying connection to NATS is closed.

Could you reconsider the architectural decision of not recovering STAN connection in case that NATS streaming server is running in embedded mode and STAN client using a not self-managed NATS connection and make somehow transparent for the STAN client how the NATS server is ran?

I will try to reproduce this test using the java streaming server client and see if the behaviour is similar.