openssl: RW locks are bad for performance

OpenSSL uses read-write locks (e.g., pthread_rwlock_t on POSIX systems). Often these locks are used to protect data structures that should not change often, like providers lists.

Read-write locks are not a good thread synchronization mechanism. Readers can block when a writer gets the lock, and good rw lock implementations will not allow writer starvation, which means that if there are threads that want to write, then some threads that want to read will block.

For example, the code highlighted in the stack trace on #20698 takes a read-write lock for writing just in case it has to remove providers from the providers list, but this serializes all readers.

For global data such as configuration (e.g., trust anchors), providers lists, engines, a much better technique is read-copy-update (RCU), as is used in the Linux kernel. Then readers need never block due to writers. Writers are still serialized, but mostly not blocked by readers.

Until recently there were not a lot of RCU implementations for use in user-land, and I’m not sure if there is a portable one that is license-compatible with OpenSSL. I’ve been using a C library for years that I maintain and that has similar performance characteristics to user-land RCU, a compatible license, and a very simple API. RCU can be faster and better, but anything that is better than read-write locks is going to work well for OpenSSL.

With any RCU-like scheme it is critical to have values be immutable once published. Global configuration information, providers lists, etc. should all be amenable to being treated as immuntable once produced. Immutable values can be and must be destroyed when there are no references left to them, naturally, but must not be modified between publication and the last reference disappearing.

My “thread-safe variable” API is just init/destroy and get/release/set, where a get returns the current value of the “variable” and that value is then safe to use in the same thread until the next call to get/release/set in the same thread (or the thread exits), and no value is destructed until all references are gone, with no reader blocking any writer, and no writer blocking any reader (writers get serialized). get reads the variable and acquires a reference on its current value (doing both operations atomically), and it releases the reference on the previously-obtained value in the same thread, if one had been. There are similar mechanisms in Clojure, and in Haskell, etc., but the one I’m referring to is a C library with a permissive license.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 152 (85 by maintainers)

Most upvoted comments

@nicowilliams et al

I have a subsequent test branch for anyone interested: https://github.com/nhorman/openssl/tree/lhash-rcu

In this branch I’ve started testing practical application of rcu in openssl It adds support for rcu deferred callbacks (to be called after the next synchronize_rcu call, so its easier to free data after the grace period has expired

It also fixes several bugs I’ve found when starting to use rcu in various places specifically, this branch creates a per-hash-node refcounted variant of DEFINE_LHASH_OF_EX (called DEFINE_REFCNT_LHASH_OF_EX), which implements this refcounting, as well as internal rcu locking. With this, I’ve converted the ossl_namemap api to use this variant of the hash table, and in so doing, been able to remove the surronding read/write lock in the api (i.e. our critical sections are smaller, and the read side never blocks).

I don’t have performance data on it yet of course, as its still failing 3 unit tests, and I’d like to convert the other hash tables to this variant before I start measuing anything.

But if anyone wants to play around with it and share any notes, I’d appreciate it

I don’t think we insert it into the provider list until after the self-test runs. The self-test runs very early and has to assume nothing is set up yet.

In that case, it’s probably fine w.r.t. @nicowilliams’s concerns. The unloading complication here is just that you need to synchronize with another thread which might be in the middle of using the provider.

Though, considering dlclose is traditionally implemented as a #define dclose please_make_my_process_deadlock_and_corrupt_random_global_state, you may wish to avoid it for other reasons. 😉

The provider implementation itself also seems to be a tower of mutable caches and lazy initialization and whatnot. Synchronized, lazy initialization is a bit of a pattern in OpenSSL, and one that has caused performance problems before in our experience. See #5158.

I suspect breaking away from that pattern would have helped the provider system. Rather than lazily load things from the providers, just construct all the EVP_MDs and EVP_CIPHERs at provider load time, and build some index that’s efficient to query. Then keep it immutable. At the end of the day, this is just cryptography. We play on easy and just transform byte strings to byte strings in a finite handful of well-defined ways. One shouldn’t need complex caches and synchronization just to look up what “SHA-256” is.

I tried to see whether you all had closed the door on this with the 3.0 public API, and I think fixing this may still be viable. Though I’m not positive as this plugin system was far too complex for me to follow fully in one sitting. There’s some underdocumented business with no_store / no_cache outputs to query_operation (with very unclear lifetimes!) that may pose a challenge. But broadly that seems a much better direction.

Of course, doing the initialization eagerly does mean contending with any inefficiencies in your provider conventions. I noticed several inefficient patterns (unstructured OSSL_DISPATCH tables, needing extra virtual calls to look up key metadata), which may need some work. But I think someone would need to try it to find out if they’re actually an issue in practice.

Just keep in mind that a truly accurate “just construct it and make it immutable” comparison will require rewriting most of the core provider bits. The current implementation is very, very mutation-heavy.

If anyone just wants to see the graphs directly:

image image image

It’s been mentioned before, but a call that effectively freezes the state of a libctx & it’s store could avoid lots of locking.

The 3.x provider design was too flexible in some areas… (over engineered is an apt description in some ways)

more updates. Its still a bit of a mess, but I managed to convert the OSSL_PROVIDER list to a hash table, which enabled me to remove some of the refcounting and locking from that API (though not entirely). As a result, new performance data from the providerdoall perf test, which just iteratively runs OSSL_PROVIDER_do_all, couinting the number of providers found: providerdoall_performance

its still not very parallelized, but its linear increase is smaller, and offers a consistent performance improvement (between 5% at low thread counts up to ~40% at higher thread counts)

there seems to be alot of behavioral minutue in the provider code at the moment (lots of locking/unlocking to avoid recursive deadlocks when we init child providers for instance). I think this goes above and beyond just doing a lock replacement. This is going to be a significant work effort I think.

Ok, some more statistics in response to @paulidale request for more real world performance measurement. Its not, strictly speaking “real world” per-se, but this should give us a better view of how openssl performs using higher level calls commonly made in real world applications

Change summary: Based on the rcu implementation here: https://github.com/nhorman/openssl/tree/rcu-implementation

I’ve rewritten our hash table implementation, and the internal hash table users to do 3 major things:

  1. It (the hash table) is now internally locked, using an rcu lock
  2. It (the hash table) is also now internally refcounted. This is necessecary, as any read/write operation drops the lock prior to returning. This neccesitates the addition of a lh_<type>_obj put method for the hash table to release the held refcount on any data objects returned from an insert/delete/lookup operation
  3. The users of the hash table have had their surrounding rwlocks (which were previously protecting parallel access to their corresponding hash tables) removed, as they are no longer needed

The whole thing is available here for any interested parties: https://github.com/nhorman/openssl/tree/lhash-rcu-implementation

Test methodology: Our tools repository here: https://github.com/openssl/tools Contains several performance measuring utilties:

  • handshake - Iteratively establishs an SSL connection with a peer through a bio
  • newrawkey - Iteratively calls EVP_PKEY_new_raw_public_key_ex on an in-memory public key string
  • pemread - Iteratatively calls PEM_read_bio_PrivateKey from an in memory BIO PEM string
  • providerdoall - Iteratively traverses the Provider list, counting entries
  • randbytes - Iteratively calls RAND_bytes
  • rsasign - Iteratively calls EVP_PKEY_sign on an in memory key and message buffer
  • sslnew - Iteratively calls SSL_new to create a new SSL session from an SSL_ctx

Each utility measures the amount of time taken in usecs to complete a set of these commands, and is configurable to run an arbitrary number of threads conducting these operations in parallel

For each command, I’ve run it on differing numbers of threads, multiple times (100), and graphed their reported average timings for both openssl with the above lhash changes and without. Results and comments below.

Equipment: I’m running all of these tests on a 20 core (40 hyperthread) Intel xeon system with 64Gb of ram

Results:

Handshake test: handshake_performance

SSL handshake time seems to increase almost linearly with the number of threads in use, which likely bears further investigation, but occurs in botht the rwlock and rcu case, with rcu providing approximately a 20% reduction in execution time for a set of SSL_connect/SSL_accept calls

newrawkey test: newrawkey_performance

As with the handshake test, rcu seems to provide approximately a 20% reduction in execution time per call to EVP_PKEY_new_raw_public_key_ex. The decrease in average time as thread count increases in both the rwlock and rcu case suggest to me that this code path has reasonably good parallelism, and doesn’t encounter significant additional lock contention

Pemread test: pemread_performance

This one was interesting. At higher thread counts, the rcu implementation seems to do significantly better, but at around 15-20 threads, there was a spike in both test cases, in which execution time increased, with the rcu increasing significantly more. I have no explination for this yet, though the fact that both test cases encountered this leads me to hypothesize that there is significant lock contention (or some other contention) external to the changes made. More investigation needed here

ProviderDoAll test: providerdoall_performance Performance seems to be at about parity here, with rcu doing just slightly better at high thread counts. This isn’t suprising given that, in both cases during the PROVIDER_do_all call, the rcu or rwlock is held only once per thread, so no contention is likely

Randbytes test: randbytes_performance

As with the provider test, performance is identical here, as the RAND api makes no use of internal hash tables. This is effectively just here for completeness

Rsasign test: rsasign_performance

Performance is at parity here, with execution time increasing in both cases with higher thread counts, suggesting some level of contention external to the hash table changes. Will investigate further

SSlnew test: sslnew_performance

again performance is at parity here, not suprising given that hash tables aren’t frequently consulted in this path. Decreasing execution time as thread count increases suggests that parallelism is good in this path and doesn’t suffer from lock contention

I think given the improvements in the tests that make heavier use of hash tables, I’m going to get back to the implementation of the windows api

@nicowilliams to your previous question (which I had missed): Why does read side throughput continue to decrease as writers increase. I don’t have a definitive answer for that, I’m afraid. Currently my working theory is that I have a 40-way system that I am testing on here, and as we increase the writer count above and beyond the cpu limit of my system, we start to see scheduler effects in which the readers just happen on statistically be on the cpu more often. Thats just a wild guess mind you, but it does seem to be in keep with the fact that, at/near 40 cpus writer starvation/latency becomes a clear issue.

As I’ve noted previously several times, I’d like to see some real world performance figures so we can properly target these changes. Churning multi-threaded code unnecessarily is a recipe for CVEs. How many handshakes per second do we achieve under locking versus what we might be able to possibly achieve with RCU? We simply do not currently know and without that information, we cannot make an informed decision as to what to make a priority. We don’t even know if bulk crypto is impacted – I’d really like to believe it isn’t but it’s not quantified at this point.

That 3.0 had performance issues, isn’t all that unsurprising. Performance was not on the radar during the very significant refactoring. That it hurt some of our communities has been evident and we’re working on it amongst a haystack of other priority items. I am confident that 3.1 has improved matters and 3.2 should do better again. There will continue to be further improvements going forwards I’m sure. The core team isn’t large, we can only spread so thinly and there is so much to get done.

@nicowilliams theres several notes in the changelog about why RUN_ONE/pthread_once couldn’t be used here. It amounts to the fact that the population of a namemap instance needs a pointer to the libctx and/or namemap structure itself, but pthread_once and all the macros around it don’t accept a data pointer, so were left not knowing what to initalize. But if you can think of a way to make pthread_once or something simmilar work here, I’d love to hear it.

I figured, and I wasn’t objecting. I’ve built a once executor that takes a data argument to pass to the callback. But I like this scheme because it’s easy and doesn’t require a callback.

One of the most annoying to diagnose bugs I’ve ever dealt with was code that was dlcloseing a library immediately before calling into a function pointer in that library. (Funnily enough, it involved ENGINE in a much much older OpenSSL. The function pointer was an ex_data callback registered by that engine.)

It turns out lots of things get confused when you do this. The stack trace got reported based on the old mappings and GDB through gdbserver (at least, the gdbserver that one used on Android devices many, many years ago) would start acting completely erratically.

You also could defer inserting it into the provider list until the self-test has passed. That’s the actual source of contention with the other threads, and I assume you don’t actually want the other threads to see the FIPS provider until after the self-test anyway.

Just keep in mind that a truly accurate “just construct it and make it immutable” comparison will require rewriting most of the core provider bits. The current implementation is very, very mutation-heavy.

Realistically I think OpenSSL could eagerly and immediately construct a provider list with builtin providers, but other providers will have to be loadable after the application starts making use of the builtin providers. Thus the best approach will have to be to make the providers list either copy-on-write (accessed w/ RCU) or lock-less insert-only.

A lock-less-for-reading providers list should definitely be possible, and it should even be possible to support provider unloading with a lock-less-for-reading implementation at the cost of having to atomically incref/decref providers. But it would be much simpler and better to just not support provider unloading.

(Some think that at exit() time every resource should be released, including unloading DLLs. I am not among those who think that. It’s much better to never unload DLLs except in specialized applications, and OpenSSL isn’t one of those.)

@mattcaswell no we don’t, I’m looking to see if there are any at the moment.

And as you note, it largely depends on what you’re doing. The test suite for instance by and large does single iterations of various operations, any many are single threaded, so lock contention is likely to not show up there.

I’m looking at some of the open performance issues in 3.0 right now to try identify a use case to work with.

Having the library maintain a global SSL_CTX would be problematic, for the same reason I imagine why requests doesn’t maintain an internal global Session. There are actual semantics here, that have gotten lost in the idea that SSL_CTX sharing is how you solve OpenSSL 3.x’s performance issues.

At the TLS level, different connections may share state for session resumption. But applications need to have opinions about when to do this; session resumption correlates connections and also bypasses some parts of TLS negotiation. For example, in a web browser, doing session resumption across different profiles would be a obvious privacy issue.

At the HTTP(S) level, different requests share all kinds of state, from a cache to cookies to connection pooling and, of course, TLS session resumption above. Applications need much the same opinions on when this correlation is allowed. Thus, any HTTP(S) also needs some kind of context-like object.

I assume this is what requests.Session is for. Likewise, this is what SSL_CTX is for. Now, TLS has much less shared state (really just resumption) and OpenSSL’s API has enough callbacks that one can also partition the session cache within an SSL_CTX. But it takes some work and requires the application know about everything that uses an SSL_CTX. It also has other tradeoffs around shared configuration, etc. It’s not the default expectation.

Yes, SSL_CTX is designed to be shared. But it is designed to be shared with particular semantics, which may not universally apply to every pair of SSLs in an application. We cannot retroactively change the semantics to “shared for the sake of sharing”, as a workaround for OpenSSL 3.x’s perf issues. That would not be backwards-compatible.

(Though, as an aside, unlike in TLS, where resumption is not that critical of an optimization, the kinds of optimizations you get within a correlation boundary in HTTP(S) are a big deal. I would suggest then that requests.get is maybe not a good API. But that doesn’t change the semantics for SSL_CTX here.)

The issue is, more or less, that passing an SSL_CTX around is non-obvious in higher-level APIs. Consider, for the sake of argument, something like Python’s requests module, which is used roughly like

with requests.Session() as session:
    response = session.get("https://example.com")
    ...

There are in fact helper functions requests.get() etc. whose implementation is to instantiate a session just for the one request. So it’s a straightforward programming pattern for each thread to create its own context objects for everything involved in making an HTTPS request, including an SSL_CTX. Notably, the HTTP parts of this are not expensive. 😃

Most libraries that use HTTPS internally (e.g. API clients) will often use the requests.get() helpers themselves or at best create a Session for themselves. They don’t usually accept a Session as an argument from the caller, because that would be a more awkward API and it would also turn into an API promise to keep using requests internally as opposed to some other library. They really don’t have a way to accept an SSL_CTX in particular.

Some of these libraries could perhaps make an internal global variable that holds an SSL_CTX, though for requests, some SSL parameters like trusted certificates can vary, so they would have to implement some mechanism for mapping parameters to a context. And then you’d get right back to the original question of how you handle concurrent reads and writes to that internal global variable.

All of this feels like things that really should be below the level of abstraction of an HTTPS (or other TLS-protected protocol) client and certainly below the level of abstraction of a user of HTTPS. It’s kind of like if, say, libm said “you need to pass a math context around or your square roots will be slow.” No APIs are prepared to pass a math context around nor to create their own and handle locking, and it would imply a whole lot of changes for libraries that call libraries that call libraries that do the occasional square root.

It occurs to me, by the way, that if the argument is really “you should mostly need a single SSL_CTX, unless your you’re customizing e.g. trust stores or ciphers, and if you customize them the same way you can use the same one,” I would counter-argue that this should be handled by OpenSSL itself! Give me an SSL_CTX_get function that takes some popular parameters about verify locations etc. (uncommon things can be punted to the existing SSL_CTX_new workflow) and returns a pointer to an already-created, read-only context. And figure out some scheme (like the RCU proposed in this thread) for managing the library-wide, process-wide cache of contexts that has good performance.

@nicowilliams I’m glad its somewhat understandable. Its still not 100% correct (theres a corner case in which we free a qp too early that I’m tracking down)

I haven’t tried building it, but I doubt it compiles because I can’t find a definition of USER_MASK, and also I think some of the macros may be wrong. But the gist looks right, and in fact it resembles what I do in the “slot-pair” implementation of thread-safe variables (particularly the part where a reader may have to wake a writer), which makes me think it’s likely on the right path. You’ll need to make sure you run the tests with the thread sanitizer and with helgrind, and you’ll need [some build option to?] leave out some atomics operations to demonstrate that tsan/helgrind find the resulting races (I did that by hand in ctp recently, not as build options) so as to be certain that the tests do exercise the potential races and that non-detection is due to the implementation being robust. Ideally we’d use a theorem prover to prove that the design is correct, but that’s a ton of work.

(What would be truly ideal here is if we had a time machine and could make sure that user-land RCU was in POSIX from the beginning. Alas, the existence of UTF-16 is proof that either time machines do not exist or that only evil people have them.)

As for the futex, yeah, its not ideal on the unlock side, as it introduces a potential kernel trap, but we could look at replacing that with a busy loop on the write side I think

I’ve used both, the “slot-pair” and the hazard pointer + GC implementations of thread-safe variables in production, and the occasional brief reader-side system call to wake a writer is not that big a deal. I guarantee you it will be better than the current state of the world with RW locks.

The way in which the Linux kernel RCU is implemented is not amenable to user-land, sadly, because we can’t do anything like guarantee non-preemption during the RCU critical region in readers.

As for the direction to go in terms of API, I think thats exactly the question we need to look at now. My thought with the RCU api was that it mapped a bit more easily to RWlock semantics, but it might make more sense to do the extra work to use the ctp api if the overall improvement is better. If you had cycles, it might be worth finding a use case in openssl to convert to using ctp just to measure performance improvement. I can do the same with the RCU implementation and we can compare notes

RCU is more general but impose higher cognitive load. Thread-safe variables are easier to use but less general.

The clearest use for ctp in OpenSSL would be the providers list. It’s the other uses that are less clear. I’ll set aside a few hours on Friday to look at the state of the 3.3 work to sample RW lock uses and see which are amenable to ctp and which are more easily handled with RCU.

Indeed, being designed to be shared is not the same as sharing always being appropriate.

@nhorman thanks for the link!

IIUC CRYPTO_THREAD_rcu_read_lock() is non-blocking, CRYPTO_THREAD_rcu_read_unlock() is also non-blocking but may call fpost() (which calls the futex syscall). CRYPTO_THREAD_synchronize_rcu() waits on the futex if there’s any readers that started before it.

This RCU implementation is very easy to understand, for which I’m very grateful. Its simplicity helps me understand RCU better. I believe this can be ported by replacing futexes with condition variables where futexes are not available. A reader that has to wake a possibly-waiting writer would have to acquire a lock, but that lock should be uncontended since a) no other readers would want to race to acquire it, b) there could be only one writer and that writer would immediately drop the lock when it waits on the condition variable, so the unlucky reader would never really block. This occasional slow path for readers is almost certainly not a big deal, and anyways much better than the rw lock contention issues.

If we need to do RCU-ish tasks beyond merely acquiring references to objects, then a proper RCU API would be desirable, and then you could just ignore ctp. So please don’t abandon your work yet – we should first understand if we’ll need the full power of RCU.

If we would only use RCU to acquire references to objects then the ctp thread-safe variable API is plenty and also easier to understand and use (writers thread_safe_var_set(), readers thread_safe_var_get()).

So I think the key here is to figure out if we’d need the more general RCU interface or if the thread_safe_var API is sufficient.

For the hazard pointer and garbage collection implementation of thread-safe variables in ctp readers never block or even make any system calls. The core of that is a loop where you read from the variable, write to the thread-local, then read the variable again to check that the thread-local’s value still matches the variable’s. Only atomic reads and writes are needed for this, so it’s pretty fast.

If we were happy with a stale value from the thread-local for up to some time limit, then we could even dispense with those atomics when the thread-local value is not too stale, and then the only overhead is a call to gettimeofday() (which on Linux w/ VDSO is not a system call). For the provider list, for example, having providers list alterations take 50ms to be seen by readers might be just fine, as long as the writer (thread_safe_var_set() specifically) waits for that much time if it wants to not return until the write is visible to all readers. This then would resemble the RCU grace period.

So we could could make reading quite fast in fact in the common case.

@nicowilliams sure, that would be great, I’ll post it in the am when I get back home. Fair warning, it’s embarrassingly hacked up.

@mattcaswell are you still interested in having me at an OTC meeting?

Hi Nico.

We will soon be looking at all the proposals for things to work on in 3.3 and trying to figure out priorities. Not everything on the list of proposals will necessarily be selected as a priority for 3.3. See

https://github.com/orgs/openssl/projects/2/views/28

One of those things on the above list is “Performance improvements” which could (potentially) incorporate this work.

Also see this blog post for some background:

https://www.openssl.org/blog/blog/2023/08/27/steps-forward/

Anyway - to answer your question: We’re still interested in talking to you at some point, but we’re not quite in a position to do that right now.

@hlandau

In principle the mutex fallback requirement isn’t that much of a big deal. What makes it worse is that we also assume mutex locking can fail.

For a user-land RCU thing with fallback-to-mutexes-when-there’s-no-atomics-here, a lock would be taken then immediately dropped, so what errors could there be? Not deadlocks. Not recursive lock taking. If one does thread cancellation wrong then one could have owner-dead errors, but, you know, “don’t do that” is sound advice to give to applications. On Linux there’s also this:

   `EINVAL` The mutex was created with the protocol attribute having the value `PTHREAD_PRIO_PROTECT` and the calling thread's priority is higher than the mutex's current priority ceiling.

But we wouldn’t use that attribute for fallback mutexes.

So I think this is a non-concern.

One of the nice things about lock-free data structures is that they can compose w/o having to worry about lock taking order and so on. A user-land RCU built with a fallback-to-mutexes-because-there’s-no-atomics-here would still compose because the lock taking and dropping would happen behind one function call.

So while mutexes can fail, that’s not really a concern for what would be lock-free data structures but for the fallback-to-mutexes-because-there’s-no-atomics-here thing 😃

Worth noting that the linked RCU implementation uses a pthread key per RCU item. This is a problem since OS implementations can impose a limit on the number of keys which can be created in a process. musl libc imposes a limit of 128 (and also takes a rwlock during key creation). This is both a problem in itself but we also need to consider any impacts on the application using OpenSSL — if we significantly increase our pthread key consumption, it could cause problems by causing an application to hit this limit.

The reason this uses a key per instance is just laziness, and my application needing only a handful of these thigns.

if we significantly increase our pthread key consumption, it could cause problems by causing an application to hit this limit.

I’ve also hit this limit in testing – apparently glibc has up to 256, which is precious few.

I will fix this. Just one pthread key should be sufficient – its purpose is to drive destructor calling.

I also need to add a function for calling from DllMain().

FYI I recently made sure that ctp is thread-sanitizer- and Helgrind-clean. (There had been a bug in the thread_safe_var_init() functions that doesn’t bite in production, and also two assert()s had races that also didn’t bite in production because they were disabled.)

Thanks for taking a look at this!

It’s been mentioned before, but a call that effectively freezes the state of a libctx & it’s store could avoid lots of locking.

I believe this is #15600.

Until recently there were not a lot of RCU implementations for use in user-land, and I’m not sure if there is a portable one that is license-compatible with OpenSSL.

https://paulmck.livejournal.com/69622.html discusses a few RCUish libraries in Rust, all of which are license-compatible (dual Apache 2/MIT) but not language-compatible. I mention it because they might be starting points if you’re working on adapting Nico’s library or writing your own, and also because that blog post is a good overview of what the various APIs and design choices might be. I’m going to guess they all assume atomics, though!

In addition to rwlocks performing badly once you have any writers, reader-only workloads aren’t free. Even a good rwlock implementation fundamentally has to write somewhere when taking a read lock. This means you still have some cacheline contention in heavily threaded servers.

We didn’t introduce the vast sea of global, mutable state that OpenSSL 3.x did, but even without that we see some impacts of this in BoringSSL and have plans to rework things. E.g., #5158 discussed a common pattern in OpenSSL that tends to cause this, which a better API would have avoided more naturally. And with OpenSSL 3.x levels of mutable state, I expect these kinds of problems would only compound.