openssl: RW locks are bad for performance
OpenSSL uses read-write locks (e.g., pthread_rwlock_t on POSIX systems). Often these locks are used to protect data structures that should not change often, like providers lists.
Read-write locks are not a good thread synchronization mechanism. Readers can block when a writer gets the lock, and good rw lock implementations will not allow writer starvation, which means that if there are threads that want to write, then some threads that want to read will block.
For example, the code highlighted in the stack trace on #20698 takes a read-write lock for writing just in case it has to remove providers from the providers list, but this serializes all readers.
For global data such as configuration (e.g., trust anchors), providers lists, engines, a much better technique is read-copy-update (RCU), as is used in the Linux kernel. Then readers need never block due to writers. Writers are still serialized, but mostly not blocked by readers.
Until recently there were not a lot of RCU implementations for use in user-land, and I’m not sure if there is a portable one that is license-compatible with OpenSSL. I’ve been using a C library for years that I maintain and that has similar performance characteristics to user-land RCU, a compatible license, and a very simple API. RCU can be faster and better, but anything that is better than read-write locks is going to work well for OpenSSL.
With any RCU-like scheme it is critical to have values be immutable once published. Global configuration information, providers lists, etc. should all be amenable to being treated as immuntable once produced. Immutable values can be and must be destroyed when there are no references left to them, naturally, but must not be modified between publication and the last reference disappearing.
My “thread-safe variable” API is just init/destroy and get/release/set, where a get returns the current value of the “variable” and that value is then safe to use in the same thread until the next call to get/release/set in the same thread (or the thread exits), and no value is destructed until all references are gone, with no reader blocking any writer, and no writer blocking any reader (writers get serialized). get reads the variable and acquires a reference on its current value (doing both operations atomically), and it releases the reference on the previously-obtained value in the same thread, if one had been. There are similar mechanisms in Clojure, and in Haskell, etc., but the one I’m referring to is a C library with a permissive license.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 152 (85 by maintainers)
@nicowilliams et al
I have a subsequent test branch for anyone interested: https://github.com/nhorman/openssl/tree/lhash-rcu
In this branch I’ve started testing practical application of rcu in openssl It adds support for rcu deferred callbacks (to be called after the next synchronize_rcu call, so its easier to free data after the grace period has expired
It also fixes several bugs I’ve found when starting to use rcu in various places specifically, this branch creates a per-hash-node refcounted variant of DEFINE_LHASH_OF_EX (called DEFINE_REFCNT_LHASH_OF_EX), which implements this refcounting, as well as internal rcu locking. With this, I’ve converted the ossl_namemap api to use this variant of the hash table, and in so doing, been able to remove the surronding read/write lock in the api (i.e. our critical sections are smaller, and the read side never blocks).
I don’t have performance data on it yet of course, as its still failing 3 unit tests, and I’d like to convert the other hash tables to this variant before I start measuing anything.
But if anyone wants to play around with it and share any notes, I’d appreciate it
In that case, it’s probably fine w.r.t. @nicowilliams’s concerns. The unloading complication here is just that you need to synchronize with another thread which might be in the middle of using the provider.
Though, considering
dlcloseis traditionally implemented as a#define dclose please_make_my_process_deadlock_and_corrupt_random_global_state, you may wish to avoid it for other reasons. 😉The provider implementation itself also seems to be a tower of mutable caches and lazy initialization and whatnot. Synchronized, lazy initialization is a bit of a pattern in OpenSSL, and one that has caused performance problems before in our experience. See #5158.
I suspect breaking away from that pattern would have helped the provider system. Rather than lazily load things from the providers, just construct all the
EVP_MDs andEVP_CIPHERs at provider load time, and build some index that’s efficient to query. Then keep it immutable. At the end of the day, this is just cryptography. We play on easy and just transform byte strings to byte strings in a finite handful of well-defined ways. One shouldn’t need complex caches and synchronization just to look up what “SHA-256” is.I tried to see whether you all had closed the door on this with the 3.0 public API, and I think fixing this may still be viable. Though I’m not positive as this plugin system was far too complex for me to follow fully in one sitting. There’s some underdocumented business with
no_store/no_cacheoutputs toquery_operation(with very unclear lifetimes!) that may pose a challenge. But broadly that seems a much better direction.Of course, doing the initialization eagerly does mean contending with any inefficiencies in your provider conventions. I noticed several inefficient patterns (unstructured
OSSL_DISPATCHtables, needing extra virtual calls to look up key metadata), which may need some work. But I think someone would need to try it to find out if they’re actually an issue in practice.Just keep in mind that a truly accurate “just construct it and make it immutable” comparison will require rewriting most of the core provider bits. The current implementation is very, very mutation-heavy.
If anyone just wants to see the graphs directly:
It’s been mentioned before, but a call that effectively freezes the state of a libctx & it’s store could avoid lots of locking.
The 3.x provider design was too flexible in some areas… (over engineered is an apt description in some ways)
See also https://github.com/openssl/project/issues/428
more updates. Its still a bit of a mess, but I managed to convert the OSSL_PROVIDER list to a hash table, which enabled me to remove some of the refcounting and locking from that API (though not entirely). As a result, new performance data from the providerdoall perf test, which just iteratively runs OSSL_PROVIDER_do_all, couinting the number of providers found:
its still not very parallelized, but its linear increase is smaller, and offers a consistent performance improvement (between 5% at low thread counts up to ~40% at higher thread counts)
there seems to be alot of behavioral minutue in the provider code at the moment (lots of locking/unlocking to avoid recursive deadlocks when we init child providers for instance). I think this goes above and beyond just doing a lock replacement. This is going to be a significant work effort I think.
Ok, some more statistics in response to @paulidale request for more real world performance measurement. Its not, strictly speaking “real world” per-se, but this should give us a better view of how openssl performs using higher level calls commonly made in real world applications
Change summary: Based on the rcu implementation here: https://github.com/nhorman/openssl/tree/rcu-implementation
I’ve rewritten our hash table implementation, and the internal hash table users to do 3 major things:
The whole thing is available here for any interested parties: https://github.com/nhorman/openssl/tree/lhash-rcu-implementation
Test methodology: Our tools repository here: https://github.com/openssl/tools Contains several performance measuring utilties:
Each utility measures the amount of time taken in usecs to complete a set of these commands, and is configurable to run an arbitrary number of threads conducting these operations in parallel
For each command, I’ve run it on differing numbers of threads, multiple times (100), and graphed their reported average timings for both openssl with the above lhash changes and without. Results and comments below.
Equipment: I’m running all of these tests on a 20 core (40 hyperthread) Intel xeon system with 64Gb of ram
Results:
Handshake test:
SSL handshake time seems to increase almost linearly with the number of threads in use, which likely bears further investigation, but occurs in botht the rwlock and rcu case, with rcu providing approximately a 20% reduction in execution time for a set of SSL_connect/SSL_accept calls
newrawkey test:
As with the handshake test, rcu seems to provide approximately a 20% reduction in execution time per call to EVP_PKEY_new_raw_public_key_ex. The decrease in average time as thread count increases in both the rwlock and rcu case suggest to me that this code path has reasonably good parallelism, and doesn’t encounter significant additional lock contention
Pemread test:
This one was interesting. At higher thread counts, the rcu implementation seems to do significantly better, but at around 15-20 threads, there was a spike in both test cases, in which execution time increased, with the rcu increasing significantly more. I have no explination for this yet, though the fact that both test cases encountered this leads me to hypothesize that there is significant lock contention (or some other contention) external to the changes made. More investigation needed here
ProviderDoAll test:
Performance seems to be at about parity here, with rcu doing just slightly better at high thread counts. This isn’t suprising given that, in both cases during the PROVIDER_do_all call, the rcu or rwlock is held only once per thread, so no contention is likely
Randbytes test:
As with the provider test, performance is identical here, as the RAND api makes no use of internal hash tables. This is effectively just here for completeness
Rsasign test:
Performance is at parity here, with execution time increasing in both cases with higher thread counts, suggesting some level of contention external to the hash table changes. Will investigate further
SSlnew test:
again performance is at parity here, not suprising given that hash tables aren’t frequently consulted in this path. Decreasing execution time as thread count increases suggests that parallelism is good in this path and doesn’t suffer from lock contention
I think given the improvements in the tests that make heavier use of hash tables, I’m going to get back to the implementation of the windows api
@nicowilliams to your previous question (which I had missed): Why does read side throughput continue to decrease as writers increase. I don’t have a definitive answer for that, I’m afraid. Currently my working theory is that I have a 40-way system that I am testing on here, and as we increase the writer count above and beyond the cpu limit of my system, we start to see scheduler effects in which the readers just happen on statistically be on the cpu more often. Thats just a wild guess mind you, but it does seem to be in keep with the fact that, at/near 40 cpus writer starvation/latency becomes a clear issue.
As I’ve noted previously several times, I’d like to see some real world performance figures so we can properly target these changes. Churning multi-threaded code unnecessarily is a recipe for CVEs. How many handshakes per second do we achieve under locking versus what we might be able to possibly achieve with RCU? We simply do not currently know and without that information, we cannot make an informed decision as to what to make a priority. We don’t even know if bulk crypto is impacted – I’d really like to believe it isn’t but it’s not quantified at this point.
That 3.0 had performance issues, isn’t all that unsurprising. Performance was not on the radar during the very significant refactoring. That it hurt some of our communities has been evident and we’re working on it amongst a haystack of other priority items. I am confident that 3.1 has improved matters and 3.2 should do better again. There will continue to be further improvements going forwards I’m sure. The core team isn’t large, we can only spread so thinly and there is so much to get done.
I figured, and I wasn’t objecting. I’ve built a once executor that takes a data argument to pass to the callback. But I like this scheme because it’s easy and doesn’t require a callback.
One of the most annoying to diagnose bugs I’ve ever dealt with was code that was
dlcloseing a library immediately before calling into a function pointer in that library. (Funnily enough, it involvedENGINEin a much much older OpenSSL. The function pointer was anex_datacallback registered by that engine.)It turns out lots of things get confused when you do this. The stack trace got reported based on the old mappings and GDB through gdbserver (at least, the gdbserver that one used on Android devices many, many years ago) would start acting completely erratically.
You also could defer inserting it into the provider list until the self-test has passed. That’s the actual source of contention with the other threads, and I assume you don’t actually want the other threads to see the FIPS provider until after the self-test anyway.
Realistically I think OpenSSL could eagerly and immediately construct a provider list with builtin providers, but other providers will have to be loadable after the application starts making use of the builtin providers. Thus the best approach will have to be to make the providers list either copy-on-write (accessed w/ RCU) or lock-less insert-only.
A lock-less-for-reading providers list should definitely be possible, and it should even be possible to support provider unloading with a lock-less-for-reading implementation at the cost of having to atomically incref/decref providers. But it would be much simpler and better to just not support provider unloading.
(Some think that at
exit()time every resource should be released, including unloading DLLs. I am not among those who think that. It’s much better to never unload DLLs except in specialized applications, and OpenSSL isn’t one of those.)@mattcaswell no we don’t, I’m looking to see if there are any at the moment.
And as you note, it largely depends on what you’re doing. The test suite for instance by and large does single iterations of various operations, any many are single threaded, so lock contention is likely to not show up there.
I’m looking at some of the open performance issues in 3.0 right now to try identify a use case to work with.
Having the library maintain a global
SSL_CTXwould be problematic, for the same reason I imagine whyrequestsdoesn’t maintain an internal globalSession. There are actual semantics here, that have gotten lost in the idea thatSSL_CTXsharing is how you solve OpenSSL 3.x’s performance issues.At the TLS level, different connections may share state for session resumption. But applications need to have opinions about when to do this; session resumption correlates connections and also bypasses some parts of TLS negotiation. For example, in a web browser, doing session resumption across different profiles would be a obvious privacy issue.
At the HTTP(S) level, different requests share all kinds of state, from a cache to cookies to connection pooling and, of course, TLS session resumption above. Applications need much the same opinions on when this correlation is allowed. Thus, any HTTP(S) also needs some kind of context-like object.
I assume this is what
requests.Sessionis for. Likewise, this is whatSSL_CTXis for. Now, TLS has much less shared state (really just resumption) and OpenSSL’s API has enough callbacks that one can also partition the session cache within anSSL_CTX. But it takes some work and requires the application know about everything that uses anSSL_CTX. It also has other tradeoffs around shared configuration, etc. It’s not the default expectation.Yes,
SSL_CTXis designed to be shared. But it is designed to be shared with particular semantics, which may not universally apply to every pair ofSSLs in an application. We cannot retroactively change the semantics to “shared for the sake of sharing”, as a workaround for OpenSSL 3.x’s perf issues. That would not be backwards-compatible.(Though, as an aside, unlike in TLS, where resumption is not that critical of an optimization, the kinds of optimizations you get within a correlation boundary in HTTP(S) are a big deal. I would suggest then that
requests.getis maybe not a good API. But that doesn’t change the semantics forSSL_CTXhere.)The issue is, more or less, that passing an
SSL_CTXaround is non-obvious in higher-level APIs. Consider, for the sake of argument, something like Python’srequestsmodule, which is used roughly likeThere are in fact helper functions
requests.get()etc. whose implementation is to instantiate a session just for the one request. So it’s a straightforward programming pattern for each thread to create its own context objects for everything involved in making an HTTPS request, including anSSL_CTX. Notably, the HTTP parts of this are not expensive. 😃Most libraries that use HTTPS internally (e.g. API clients) will often use the
requests.get()helpers themselves or at best create aSessionfor themselves. They don’t usually accept aSessionas an argument from the caller, because that would be a more awkward API and it would also turn into an API promise to keep usingrequestsinternally as opposed to some other library. They really don’t have a way to accept anSSL_CTXin particular.Some of these libraries could perhaps make an internal global variable that holds an
SSL_CTX, though forrequests, some SSL parameters like trusted certificates can vary, so they would have to implement some mechanism for mapping parameters to a context. And then you’d get right back to the original question of how you handle concurrent reads and writes to that internal global variable.All of this feels like things that really should be below the level of abstraction of an HTTPS (or other TLS-protected protocol) client and certainly below the level of abstraction of a user of HTTPS. It’s kind of like if, say, libm said “you need to pass a math context around or your square roots will be slow.” No APIs are prepared to pass a math context around nor to create their own and handle locking, and it would imply a whole lot of changes for libraries that call libraries that call libraries that do the occasional square root.
It occurs to me, by the way, that if the argument is really “you should mostly need a single
SSL_CTX, unless your you’re customizing e.g. trust stores or ciphers, and if you customize them the same way you can use the same one,” I would counter-argue that this should be handled by OpenSSL itself! Give me anSSL_CTX_getfunction that takes some popular parameters about verify locations etc. (uncommon things can be punted to the existingSSL_CTX_newworkflow) and returns a pointer to an already-created, read-only context. And figure out some scheme (like the RCU proposed in this thread) for managing the library-wide, process-wide cache of contexts that has good performance.I haven’t tried building it, but I doubt it compiles because I can’t find a definition of
USER_MASK, and also I think some of the macros may be wrong. But the gist looks right, and in fact it resembles what I do in the “slot-pair” implementation of thread-safe variables (particularly the part where a reader may have to wake a writer), which makes me think it’s likely on the right path. You’ll need to make sure you run the tests with the thread sanitizer and with helgrind, and you’ll need [some build option to?] leave out some atomics operations to demonstrate that tsan/helgrind find the resulting races (I did that by hand in ctp recently, not as build options) so as to be certain that the tests do exercise the potential races and that non-detection is due to the implementation being robust. Ideally we’d use a theorem prover to prove that the design is correct, but that’s a ton of work.(What would be truly ideal here is if we had a time machine and could make sure that user-land RCU was in POSIX from the beginning. Alas, the existence of UTF-16 is proof that either time machines do not exist or that only evil people have them.)
I’ve used both, the “slot-pair” and the hazard pointer + GC implementations of thread-safe variables in production, and the occasional brief reader-side system call to wake a writer is not that big a deal. I guarantee you it will be better than the current state of the world with RW locks.
The way in which the Linux kernel RCU is implemented is not amenable to user-land, sadly, because we can’t do anything like guarantee non-preemption during the RCU critical region in readers.
RCU is more general but impose higher cognitive load. Thread-safe variables are easier to use but less general.
The clearest use for ctp in OpenSSL would be the providers list. It’s the other uses that are less clear. I’ll set aside a few hours on Friday to look at the state of the 3.3 work to sample RW lock uses and see which are amenable to ctp and which are more easily handled with RCU.
Indeed, being designed to be shared is not the same as sharing always being appropriate.
@nhorman thanks for the link!
IIUC
CRYPTO_THREAD_rcu_read_lock()is non-blocking,CRYPTO_THREAD_rcu_read_unlock()is also non-blocking but may callfpost()(which calls the futex syscall).CRYPTO_THREAD_synchronize_rcu()waits on the futex if there’s any readers that started before it.This RCU implementation is very easy to understand, for which I’m very grateful. Its simplicity helps me understand RCU better. I believe this can be ported by replacing futexes with condition variables where futexes are not available. A reader that has to wake a possibly-waiting writer would have to acquire a lock, but that lock should be uncontended since a) no other readers would want to race to acquire it, b) there could be only one writer and that writer would immediately drop the lock when it waits on the condition variable, so the unlucky reader would never really block. This occasional slow path for readers is almost certainly not a big deal, and anyways much better than the rw lock contention issues.
If we need to do RCU-ish tasks beyond merely acquiring references to objects, then a proper RCU API would be desirable, and then you could just ignore
ctp. So please don’t abandon your work yet – we should first understand if we’ll need the full power of RCU.If we would only use RCU to acquire references to objects then the
ctpthread-safe variable API is plenty and also easier to understand and use (writersthread_safe_var_set(), readersthread_safe_var_get()).So I think the key here is to figure out if we’d need the more general RCU interface or if the
thread_safe_varAPI is sufficient.For the hazard pointer and garbage collection implementation of thread-safe variables in
ctpreaders never block or even make any system calls. The core of that is a loop where you read from the variable, write to the thread-local, then read the variable again to check that the thread-local’s value still matches the variable’s. Only atomic reads and writes are needed for this, so it’s pretty fast.If we were happy with a stale value from the thread-local for up to some time limit, then we could even dispense with those atomics when the thread-local value is not too stale, and then the only overhead is a call to
gettimeofday()(which on Linux w/ VDSO is not a system call). For the provider list, for example, having providers list alterations take50msto be seen by readers might be just fine, as long as the writer (thread_safe_var_set()specifically) waits for that much time if it wants to not return until the write is visible to all readers. This then would resemble the RCU grace period.So we could could make reading quite fast in fact in the common case.
@nicowilliams sure, that would be great, I’ll post it in the am when I get back home. Fair warning, it’s embarrassingly hacked up.
Hi Nico.
We will soon be looking at all the proposals for things to work on in 3.3 and trying to figure out priorities. Not everything on the list of proposals will necessarily be selected as a priority for 3.3. See
https://github.com/orgs/openssl/projects/2/views/28
One of those things on the above list is “Performance improvements” which could (potentially) incorporate this work.
Also see this blog post for some background:
https://www.openssl.org/blog/blog/2023/08/27/steps-forward/
Anyway - to answer your question: We’re still interested in talking to you at some point, but we’re not quite in a position to do that right now.
@hlandau
For a user-land RCU thing with fallback-to-mutexes-when-there’s-no-atomics-here, a lock would be taken then immediately dropped, so what errors could there be? Not deadlocks. Not recursive lock taking. If one does thread cancellation wrong then one could have owner-dead errors, but, you know, “don’t do that” is sound advice to give to applications. On Linux there’s also this:
But we wouldn’t use that attribute for fallback mutexes.
So I think this is a non-concern.
One of the nice things about lock-free data structures is that they can compose w/o having to worry about lock taking order and so on. A user-land RCU built with a fallback-to-mutexes-because-there’s-no-atomics-here would still compose because the lock taking and dropping would happen behind one function call.
So while mutexes can fail, that’s not really a concern for what would be lock-free data structures but for the fallback-to-mutexes-because-there’s-no-atomics-here thing 😃
The reason this uses a key per instance is just laziness, and my application needing only a handful of these thigns.
I’ve also hit this limit in testing – apparently glibc has up to 256, which is precious few.
I will fix this. Just one pthread key should be sufficient – its purpose is to drive destructor calling.
I also need to add a function for calling from
DllMain().FYI I recently made sure that
ctpis thread-sanitizer- and Helgrind-clean. (There had been a bug in thethread_safe_var_init()functions that doesn’t bite in production, and also twoassert()s had races that also didn’t bite in production because they were disabled.)Thanks for taking a look at this!
I believe this is #15600.
https://paulmck.livejournal.com/69622.html discusses a few RCUish libraries in Rust, all of which are license-compatible (dual Apache 2/MIT) but not language-compatible. I mention it because they might be starting points if you’re working on adapting Nico’s library or writing your own, and also because that blog post is a good overview of what the various APIs and design choices might be. I’m going to guess they all assume atomics, though!
In addition to rwlocks performing badly once you have any writers, reader-only workloads aren’t free. Even a good rwlock implementation fundamentally has to write somewhere when taking a read lock. This means you still have some cacheline contention in heavily threaded servers.
We didn’t introduce the vast sea of global, mutable state that OpenSSL 3.x did, but even without that we see some impacts of this in BoringSSL and have plans to rework things. E.g., #5158 discussed a common pattern in OpenSSL that tends to cause this, which a better API would have avoided more naturally. And with OpenSSL 3.x levels of mutable state, I expect these kinds of problems would only compound.