nats.rs: all async operations block forever at 500 concurrent subscriptions

Under the hood, asynchronous subscriptions just wrap synchronous calls with blocking::unblock. Under the hood, unblock moves work to a pool of up to 500 threads.

This creates an obvious bottleneck: If you create 500 subscriptions, all other asynchronous operations that you attempt will just block forever (or until those subscriptions receive data or are closed).

Perhaps the blocking crate isn’t the best fit here. And perhaps there should be a bigger warning about everything breaking if you hit this limit.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (14 by maintainers)

Most upvoted comments

Like @ccbrown, I also wonder about the suboptimal way in which async has been implemented for nats. But I’m even more surprised about how this implementation is considered ok and good enough until users come up with complex benchmarks to prove otherwise.

According to the README:

Rust may be the most interesting new language the NATS ecosystem has seen. We believe this client will have a large impact on NATS, distributed systems, and embedded and IoT environments. With Rust we wanted to be as idiomatic as we could be and lean into the strengths of the language. We moved many things that would have been runtime checks and errors to the compiler, most notably options on connections, and having subscriptions generate multiple styles of iterators, since iterators are a first class citizen in Rust. We also wanted to be aligned with the NATS philosophy of simple, secure, and fast!

At best, I find the current implementation of async not to be aligned with those goals. At worst, the general description of the goals may be misleading for some users

Rust async is far more than just “benchmarks”. This library claims to be Rust-idiomatic and yet there are a few areas where I disagree.

In what I could gather by reading the async module:

  • There is a hard-coded limit of 500 concurrent tasks by using the blocking crate
  • Management of subscriptions do not follow the principle of least astonishment for rust users. They would normally expect that when a subscription gets dropped there is no risk of memory leaks. It seems that the manual management of threads leads to gotchas that users of rust seek to avoid by precisely switching to rust. I find shocking the lack of consideration fior https://github.com/nats-io/nats.rs/pull/209
  • This library may not benefit of new io-uring kernel async apis for network requests that are soon coming to libraries like tokio
  • There is a risk of deadlocks. The Nats protocol is implemented as sync (i.e. no async TCP sockets) and there is the possibility that the internals of pub/sub don’t yield when something has gone wrong. This could literally bring the whole process to a halt. This risk is quite considerable and the Tokio team has issued warnings against “asyncified” approaches like the one in this library:

The spawn_blocking function can be used to “asyncify” these sorts of tasks by running them on a thread pool where blocking is allowed. Tokio does not, and will not attempt to detect blocking tasks and automatically compensate by adding threads to the scheduler. This question has come up a number of times in the past, so allow me to elaborate.

The quote comes from https://tokio.rs/blog/2020-04-preemption and it’s a very good read

I can understand having limited development resources and I’m well aware that sync performs better than async. I certainly never questioned either of those things.

What I am questioning is:

Why is there zero mention of the async crate’s crippling limitations in the README?

The Rust ecosystem has a diverse set of options for async programming. This client library can be used with any async runtime out of the box, such as async-std and tokio.

The async interface provided by this library is implemented as just a thin wrapper around its sync interface. Those two interface styles look very similar, and you’re free to choose whichever works best for your application.

If the limitations are known, but not something you can address and not something you’re open to PRs for, the least you could do for the community is warn them:

The async interface is known to have serious limitations that may cause leaks or deadlocks, and should never be used in production environments.

Maybe also bold it or add some ⚠️’s because that’s the first thing that anyone thinking about using the crate should know.

Is the team open to community discussions or contributions for issues like these?

You have said a lot here about best practices, the crate, and its users. What you have not done is expressed an openness towards improving.

You immediately closed https://github.com/nats-io/nats.rs/pull/209 with a lesson on using threads. In a collaborative environment, a peer would have voiced their concerns in a comment (without closing the PR) or used GitHub’s “request changes” function. It is absolutely possible to solve that problem in a way that addresses your concern, and I am doing so now in my own system. But now it’s every man or woman for himself.

Now, this issue isn’t about https://github.com/nats-io/nats.rs/pull/209 at all. That’s a separate issue. But my impression based on your comments so far is that you’re too protective of this crate to risk letting others muck with it.

And so my question is, what’s the contribution policy? As much as I’d like to give back to the community, I’d also like to avoid wasting your and my time.

I have a question, should we consider doing a specific tokio based offering for the async?

I started with a short answer and then my notes got longer, so I guess my answer would be a dreadful “it depends”. Here’s what I have:

About async

Rust’s async is quite special when compared with other languages. One of the reasons may be that Rust’s core/standard library is the opposite of Golang’s, in the sense that it’s kept very lean and small. It includes only the most basic functionality and leaves the implementation of the rest to the community. This philosophy is very much up to debate, but it’d be fair to say that it’s shaped by a unix/systems programming mindset.

The case of async is just another example of this philosophy. There are Future types in the standard library but it’s up to the community to maintain and implement runtime environments for them.

There are three major implementations of async runtimes:

  • Tokio (more than 2 million monthly downloads)
  • Async-std (around 300K monthly downloads)
  • Smol (around 42k monthly downloads). Nats library currently uses a crate maintained by the smol team.

In theory, users can pick and choose which one to use. In practice, it’s a bit more complicated than that. If you’re building an application you would normally assess what libraries you need and which runtime they support. Because Tokio is so widely adopted, tokio is usually what every new project starts with.

For crats like nats, it’s a bit more complicated because by picking a runtime one could alienate users of another runtime. This is where having a look at options may help.

Async for libraries

For what I can gather, the async parts in Nats will mostly deal with network calls and the handling of TCP connections.

In order to make the library work async there are different approaches:

1. Asyncify

One could “asyncify” an existing codebase, that is, a crate which only operates in blocking mode could be wrapped with code provided by the async runtimes so that concurrency can be a possibility. All runtimes provide their own methods for this. For example, tokio has spawn_blocking and spawn_in_place.

The biggest benefit of this approach is that it’s extremely quick and easy to implement.

This is the approach which the NATS crate now has. In this case, there is a second big benefit. By using the blocking crate from smol, this library is already runtime agnostic. Unlike tokio “asyncify” methods, the blocking crate provides a method that works with any runtime. I’m using it with tokio in my project just fine at the moment.

However, this approach has one major drawback which led me to write my last comment. It works, but it isn’t great. It’s not the best, it’s not terrible. It just works. I won’t repeat here what I’ve said already, just see my comment above.

Some Rust developers may even consider “asyncified” code bases a bad smell for technical debt. For example, the great diesel crate is extremely popular but the request to support async is long and painful. This issue provides an excruciating read about how complex this can get. The community has created crates like tokio-diesel, and they have the same issues already discussed in this thread: concurrency with caps and limits, risks of deadlocks, good performance on small servers but unable to scale, etc.

This approach may be good for community projects which lack resources to do otherwise. They do work after all. But I personally find it’s a big contradiction for a crate like nats, which describes itself as highly performant and efficient.

2. “Native” async

In contrast to diesel, the authors of sqlx took an entirely different approach: async from the bottom up. This allows them to claim that sqlx is a rust crate which allows concurrency without limits and gotchas. That’s what you get when async is fully supported. This doesn’t come for free though. Sqlx has to support both tokio and async-std if it wants to please everybody

The redis crate is very similar in that they have the sync version, and then users can opt in to async methods with either tokio or async-std.

Redis is also a great learning case, since the official tokio docs are essentially a learn-by-example tutorial that guides the user towards building an async Redis client in Rust.

What then

At this point it would be up to the NATS team to decide what the priorities would be in terms of async support:

  1. The first option is to just carry on with current approach. If this is the case, as a user, I would prefer that the README and project scope is revisited to warn users about gotchas: concurrency caps, partial support of Drop trait where user has to manage subscriptions manually, etc. This would be more honest and sets the right expectations for new users. Otherwise I’d be surprised if a new github issue about async is not created every month by a user who gets caught by this

  2. Rewrite the async module taking something like the Redis crate as inspiration. For what I can tell the Redis crate is extremely similar to Nats: no serialization, no big CPU operations and a lot of connection management and network IO operations. This would be the ideal scenario from the user’s perspective. It’s also the biggest amount of work

  3. If resources are limited, do like (2) but focus only on one runtime like Tokio. I think this could be a good compromise for now. With the right abstractions, support for other runtimes could be planned in the long term or left for community contributions.

Ideally nats would be totally runtime agnostic, but it’s unfortunately still very difficult to write good runtime-agnostic libraries (see https://github.com/rust-lang/wg-async-foundations/issues/45). It’s especially difficult when timers and networking are involved and you want good performance.

I believe currently the best approach for high performance libraries is to create an abstraction over a few of the most popular runtimes and let the user choose one via feature flags.

See for example, the Redis crate, which I believe has pretty similar needs. It defines a RedisRuntime trait and provides implementations of that trait for Tokio and Async-std:

https://github.com/mitsuhiko/redis-rs/blob/ca5019dbe76cc56c93eaecb5721de8fcf74d1641/src/aio.rs#L55

Tokio is by far the most prolific async runtime, so a Tokio implementation would definitely be the place to start.

@Owez Let us know if you have any issues with new async based client…

0.15.2 addresses https://github.com/nats-io/nats.rs/issues/213, but not this. There is still a hidden limit of 500 outstanding async operations, at which point any further calls to async functions will block forever. I believe this should be re-opened.

FWIW, I would support a rewrite/refactorization to have an async-native version of the nats crate. In my use case I don’t think I’ll hit the limit, but we deal with very sensitive information (I am working in the HIS of the hospital I work for), so we need the maximum reliability and performance.

In order to reduce the maintenance burden, there’s also the possibility of building the sync crate as a facade to the async one (i.e, the opposite you’ve done). There are some crates that follow that strategy.