openssl: 3.0 performance degraded due to locking

Using some internal QUIC tests, we see this timing in 1.1:

$ USE_GQUIC_VERSIONS=1 perf record -F 999 -g ./quic_lib_tests  --gtest_repeat=100  --gtest_filter=*ZeroRttDisabled*
$ perf report

+   31.73%    31.55%  quic_lib_tests  quic_lib_tests       [.] bn_sqr8x_internal
+    9.28%     9.28%  quic_lib_tests  quic_lib_tests       [.] mul4x_internal
+    4.91%     4.91%  quic_lib_tests  quic_lib_tests       [.] sha256_block_data_order_avx

In 3.0 we see this:


$ USE_GQUIC_VERSIONS=1 perf record -F 999 -g ./quic_lib_tests  --gtest_repeat=100  --gtest_filter=*ZeroRttDisabled*
$ perf report

+   11.02%    10.99%  quic_lib_tests  quic_lib_tests       [.] bn_sqr8x_internal
+    8.38%     8.08%  quic_lib_tests  libpthread-2.31.so   [.] __pthread_rwlock_rdlock
+    7.65%     7.51%  quic_lib_tests  libpthread-2.31.so   [.] __pthread_rwlock_unlock
+    4.98%     4.78%  quic_lib_tests  quic_lib_tests       [.] getrn
+    4.14%     4.11%  quic_lib_tests  quic_lib_tests       [.] mul4x_internal
+    3.37%     2.57%  quic_lib_tests  quic_lib_tests       [.] ossl_tolower
     3.30%     3.30%  quic_lib_tests  quic_lib_tests       [.] ossl_lh_strcasehash
+    2.72%     2.13%  quic_lib_tests  quic_lib_tests       [.] OPENSSL_strcasecmp
+    2.29%     2.05%  quic_lib_tests  quic_lib_tests       [.] ossl_lib_ctx_get_data
+    1.93%     1.93%  quic_lib_tests  quic_lib_tests       [.] sha256_block_data_order_avx

This seems to be part of OSSL_DECODER_CTX_new_for_pkey 16% of the time is spent in locking, on a single threaded binary. And 10% is in a string hashtable lookup.

If anyone on the project is going to look at this, I will try to get a small reproducer. But our the time for our QUIC tests is doubling.

About this issue

Original URL
State: open
Created a year ago
Reactions: 4
Comments: 183 (99 by maintainers)

Commits related to this issue

Avoid taking a write lock in ossl_provider_doall_activated() We refactor ossl_provider_doall_activated() so that we only need to take a read lock instead of a write lock for the flag_lock. This shoul... — committed to mattcaswell/openssl by mattcaswell a year ago
Avoid taking a write lock in ossl_provider_doall_activated() We refactor ossl_provider_doall_activated() so that we only need to take a read lock instead of a write lock for the flag_lock. This shoul... — committed to mattcaswell/openssl by mattcaswell a year ago
Avoid taking a write lock in RAND_get_rand_method() The function RAND_get_rand_method() is called everytime RAND_bytes() or RAND_priv_bytes() is called. We were obtaining a write lock in order to fin... — committed to mattcaswell/openssl by mattcaswell a year ago
Avoid taking a write lock in RAND_get_rand_method() The function RAND_get_rand_method() is called every time RAND_bytes() or RAND_priv_bytes() is called. We were obtaining a write lock in order to fi... — committed to mattcaswell/openssl by mattcaswell a year ago
Don't take a write lock when freeing an EVP_PKEY When freeing the last reference to an EVP_PKEY there is no point in taking the lock for the key. It is the last reference and is being freed so must o... — committed to mattcaswell/openssl by mattcaswell a year ago
When we're just reading EX_CALLBACK data just get a read lock The crypto_ex_data code was always obtaining a write lock in all functions regardless of whether we were only reading EX_CALLBACK data or... — committed to mattcaswell/openssl by mattcaswell a year ago
Don't get a write lock when querying the parent reseed_count or strength A read lock should be sufficient when all we are doing is querying the parent DRBG reseed_count or strength. Partially fixes ... — committed to mattcaswell/openssl by mattcaswell a year ago
Modify ENGINE_pkey_asn1_find_str() to use a read lock instead of a write ENGINE_pkey_asn1_find_str() does not make any modifications to fields controlled by the global_engine_lock. The only change ma... — committed to mattcaswell/openssl by mattcaswell a year ago
Avoid an unneccessary lock if we didn't add anything to the store Partially fixes #20286 — committed to mattcaswell/openssl by mattcaswell a year ago
Don't take a write lock to retrieve a value from a stack ossl_x509_store_ctx_get_by_subject() was taking a write lock for the store, but was only (usually) retrieving a value from the stack of object... — committed to mattcaswell/openssl by mattcaswell a year ago
Optimise locking in rsa_get_blinding() We optimise locking in rsa_get_blinding() so that we normally take a read lock, and only fallback to a write lock if we need to. This will be very slightly slow... — committed to mattcaswell/openssl by mattcaswell a year ago
Refactor the DRBG implementations to manage locking themselves Previously the EVP layer would call lock and unlock functions on the underlying DRBG implementation to say when a lock should be acquire... — committed to mattcaswell/openssl by mattcaswell a year ago
Enable obtaining certain DRBG params without a lock Even if a DRBG has locking enabled on it, there are certain parameters which are still safe to obtain even without a lock. The max_request value is... — committed to mattcaswell/openssl by mattcaswell a year ago
Refactor the DRBG implementations to manage locking themselves Previously the EVP layer would call lock and unlock functions on the underlying DRBG implementation to say when a lock should be acquire... — committed to mattcaswell/openssl by mattcaswell a year ago
Enable obtaining certain DRBG params without a lock Even if a DRBG has locking enabled on it, there are certain parameters which are still safe to obtain even without a lock. The max_request value is... — committed to mattcaswell/openssl by mattcaswell a year ago
Refactor the DRBG implementations to manage locking themselves Previously the EVP layer would call lock and unlock functions on the underlying DRBG implementation to say when a lock should be acquire... — committed to mattcaswell/openssl by mattcaswell a year ago
Enable obtaining certain DRBG params without a lock Even if a DRBG has locking enabled on it, there are certain parameters which are still safe to obtain even without a lock. The max_request value is... — committed to mattcaswell/openssl by mattcaswell a year ago
Refactor the DRBG implementations to manage locking themselves Previously the EVP layer would call lock and unlock functions on the underlying DRBG implementation to say when a lock should be acquire... — committed to mattcaswell/openssl by mattcaswell a year ago
Enable obtaining certain DRBG params without a lock Even if a DRBG has locking enabled on it, there are certain parameters which are still safe to obtain even without a lock. The max_request value is... — committed to mattcaswell/openssl by mattcaswell a year ago
Avoid taking a write lock in ossl_provider_doall_activated() We refactor ossl_provider_doall_activated() so that we only need to take a read lock instead of a write lock for the flag_lock. This shoul... — committed to quictls/openssl by mattcaswell a year ago

Most upvoted comments

So I have the following:

Am I missing any? I am working on merging this into a branch of quictls/openssl so that people can test.

richsalz on May 22, 2023

Here is a small reproducer in C++, written by an Akamai colleague.

The code is at https://gist.github.com/richsalz/31a98a3095fa36ab6a66082a0c028a6f

Here are the compile commands and timing. This small reproducer is 17 times slower. When used in our QUIC implementation, small transfers average twice as slowly. Microsoft’s MSQUIC (their production-quality open source stack on Windows and Azure) avoids use of 3.0 for this exact reason; they use the 1.1 branch of @quictls/openssl

$ g++ -O2 -Wall -std=c++17 test_25519.cc -I openssl1/common/include/ openssl1/common/lib/libcrypto.a -lpthread -ldl -o bin.11
$ g++ -O2 -Wall -std=c++17 test_25519.cc -I openssl3/common/include/ openssl3/common/lib/libcrypto.a -lpthread -ldl -o bin.30

$ time ./bin.11

real	0m3.484s
user	0m3.481s
sys	0m0.002s

$ time ./bin.30 

real	0m53.588s
user	0m53.566s
sys	0m0.020s

richsalz on Feb 16, 2023

@nicowilliams : “every time a shared-mutable structure XXXX -> publish a new immutable version inside” I assume XXXX means “changes” and that you left a word out. Also, what does “publish” mean?

Imagine that struct ssl_ctx_st has a field of type struct ssl_ctx_st * that points to an immutable version of itself, and that you use RCU or similar to a) replace that pointer with a pointer to the next new immutable version, b) to read it.

nicowilliams on Aug 31, 2023

Just this right after SSL_library_init():
	CRYPTO_set_locking_callback(ssl_locking_function);

There can be only one user doing this in a process, and it must do this before all uses of OpenSSL. If two users try this, then the first must win and the second must lose, and the first user’s lock callbacks must be the only ones that get used. If any user causes OpenSSL [self-]initialization to happen then all subsequent attempts to initialize OpenSSL must be no-ops.

And I strenuously object to libraries not having self-initialization. Optional initialization when you’re the first, that’s OK but lame, but really, no, OpenSSL’s threading performance should not depend on you providing better lock callbacks. The way to make OpenSSL thread performance not suck is to make OpenSSL do more shared-nothing-mutable, as that would greatly reduce the amount of lock contention.

nicowilliams on Aug 31, 2023

Yes, @richsalz, we are in sync on this list.

arapov on May 22, 2023

Or, if deactivation would not actually ever unload the provider, just make it unavailable for further fetches, we could just atomically increment or decrement the activation count. It actually does NOT matter what any other thread does in such case - an application cannot expect consistent behavior if multiple threads are randomly loading/unloading providers in the same library context.

We could truly unload the provider only when a libctx is being tear down.

t8m on May 10, 2023

Since the team seems to no longer be spending any effort on reducing locking, downstream users have little choice.

But I disagree with your comment, @t8m: lightweight locking is a completely reasonable option, especially if the heavy guarantees are not needed.

richsalz on May 3, 2023

But I think I will eventually again relatively soon, so as time permits I’ll try to collect more info.

The other thing to bear in mind is that there are numerous performance improvements in 3.1. It’s still not at 1.1.1 levels, but its better than 3.0. If you do re-run the exercise, doing it against 3.1 would be most helpful for us to figure out what we still have left to do.

mattcaswell on Feb 23, 2023

On our side, we spent some time trying to use a simple HTTP client / load generator on Ubuntu 22 and were stumped to see it saturate at around 400 connections per second for 48 threads while on other machines it worked fine. We noticed that there was something completely odd regarding locking as it was impossible for the process to use much more than one CPU and that most of the time was spent in futex_wake() and friends. After trying many things we finally locally built 1.1.1 and linked against it and suddenly the performance reached 74000 cps, or almost 200 times more.

Additionally, on the server side we already noticed an extreme use of locking, with chains starting like this one and involving pthread_rwlocks:

  92.80%  [kernel]                  [k] native_queued_spin_lock_slowpath.part.0

I instrumented the code in hope to find “the” culprit and found something like 80 calls to any flavor of pthread_rwlock_*() in a single accept/get/response/close cycle! 1.1.1 already had quite a bunch of them (I don’t remember how many) but significantly less overhead, as if it was used on shorter chains. In the end we reimplemented our own locks (same principle as the lock callbacks that were unfortunately dropped from 1.1) and could more than double the server-side performance thanks to using much lighter locks.

So yes, there definitely is a big locking problem there. The rule of thumb for a call to any pthread_* function should normally be “am I really willing to give up the CPU and switch to another task, possibly thrashing all the L1 cache here?”. Of course that’s not exact but in terms of cost and impacts it’s extremely close to reality. pthread_* functions should only be for high-level code (browsers etc) and rare cases. It’s way too expensive for low-level or library code.

I failed to find any single occurrence of an atomic operation in the whole codebase, which makes me think that all of them are instead implemented by hand around a lock, so that’s a dead end, when you consider that scalable code normally tries hard to avoid atomic ops because they slow things down, and that pthread_* functions do involve atomic ops and a syscall in the contented case that is at least 100 times slower than the atomic op itself…

For now we’re condemned to stick to 1.1.1, as 3.0 is not just “degraded”, it’s simply not realistically usable at all for anything beyond the openssl command-line utility for us. This will pose security issues in field over time once 1.1.1 stops being supported, as power users are only starting to discover the big regression now by progressively switching to new distros. But for now we have no other choice.

wtarreau on Feb 21, 2023

Dear @nibanks I’d split things apart.

The redesign to provider causes significant slowdown in some specific areas. It’s a problem to be solved on OpenSSL level for corner cases or in general.
The architectural difference between 1.1.1 and 3.0 series is significant itself. So the approaches worked well for 1.1.1 should be reconsidered for 3.0. @paulidale provided a totally relevant recommendation and ignoring it looks like a suboptimal decision to me.

beldmit on Feb 20, 2023

Replacing LHASH, both API and implementation, would get a thumbs up from me. The existing LHASH code can be left as a near-deprecated public API for applications which still use it. Unfortunately LHASH is used as part of our public API in places so we can’t realistically formally deprecate it.

I certainly have some ideas on the design of a replacement (internal) API.

I’d like to see any replacement minimise allocations as much as possible and maybe even support moves for objects which can support it (partially intrusive design). I also want it to generate code which allows hash and equality function pointer calls to be devirtualised and inlined.

Some food for thought:

(newer, includes detailed design diagrams and discussion) https://www.boost.org/doc/libs/1_84_0/libs/unordered/doc/html/unordered.html#intro
- early design discussion: https://pdimov.github.io/articles/unordered_dev_plan.html
- blogpost: https://bannalia.blogspot.com/2022/06/advancing-state-of-art-for.html
- blogpost: http://bannalia.blogspot.com/2013/10/implementation-of-c-unordered.html
- blogpost: https://bannalia.blogspot.com/2022/11/inside-boostunorderedflatmap.html
(older) https://www.boost.org/doc/libs/1_84_0/doc/html/intrusive/unordered_set_unordered_multiset.html
the above research cited by @paulidale

hlandau on Feb 15, 2024

I’ve tried the test_25519 reproducer with all the current 3.x branches and 1.1.1. Here are the results:

Branch	Real Time avg	Performance penalty from 1.1.1
1.1.1	2.40 s	0%
3.0	27.9 s	1146 %
3.1	4.56 s	90 %
3.2	3.17 s	32 %

I think we are pretty near the 1.1.1 performance on this particular benchmark that’s basically testing a repeated decoding of 25519 key with 3.2.

t8m on Feb 13, 2024

[moved reply comment to https://github.com/openssl/openssl/issues/21918]

Regarding OpenSSL’s excessive use of shared mutable state: wholly agree. This was an architectural mistake and we need to back out of it and find a way to do it that doesn’t break API. So I’m basically in support of the architectural destination you’re elaborating on here.

hlandau on Aug 31, 2023

User-provided lock callbacks conflict with OpenSSL being used by multiple different libraries in the same process

This is simply not true. Under no circumstances should any lock be held during the program’s initialization. If that were the case, even in its current state it would simply deadlock.

If, and only if only applications ever set lock callbacks, and never ever ever libraries that use OpenSSL, and still, if OpenSSL self-initializes for everything on first-need, then that would be almost OK.

The program (the application, main()) may not even know that it has to initialize OpenSSL because libraries that the program uses use OpenSSL. For example, just using the name service switch w/o an nscd can cause this to happen via name service switch plugins that run in the application processes without the application authors even knowing.

Not only that, but OpenSSL uses versioned symbols now and multiple versions can now co-exist in the same process. Yes, I’ve got apps that indirectly have OpenSSL 1.1 and 3.0 loaded in the same process, and never the twain meet. But each object gets to link with only one version of OpenSSL, so the only way the application could set lock callbacks for all versions would be for it to dlopen() them and use dlfsym(). (Sure, if I needed to set different lock primitives, I’d probably only have to do it for one version of OpenSSL, so that might work.)

Going back to global lock callbacks would be a disaster if ever any OpenSSL-using libraries install their own lock callbacks. And I guarantee that some developer somewhere will misunderstand and try to initialize OpenSSL lock callbacks in their library.

Build-time doesn’t work with distros, the locking mechanism depends on the program and not on the operating system, even less on the distro, so there’s no way a build-time config will be suitable (just like the current choice is not suitable for servers and might work fine for browsers).

You also have ELF symbol interposition to help you in this case. Symbol interposition is actually a very nice solution because you don’t even have to change the application if you don’t want to or can’t – just use LD_PRELOAD.

OpenSSL just has to get their design for threaded scalability right. There should be no need to switch to lock primitives that use spinlocks or anything of the sort.

Read-write locks are a performance disaster. So are mutexes for that matter. The correct answer is for OpenSSL to switch to a shared-nothing-mutable model, but that’s clearly not happening before 3.4 now 😢.

nicowilliams on Aug 31, 2023

Libraries calling abort is never a good idea. Checking and returning an error certainly should be done.

In general 100% yes. It is for an application not a library to decide how to handle errors, and calling abort would preempt this. The only exception as I see it is assertion failures — and that is the sole circumstance where OpenSSL (correctly) uses abort today. (Allowing an application to provide a custom assertion failure handler callback would also be reasonable, but after that point any attempt to reenter the library should rightly be considered UB.)

I agree. What we’re hearing is “checking for locking errors is ETOOHARD” – that’s probably the real problem here, and it falls under my “I don’t believe in infallible locks” comment 😃

If we’re going to have pretend-infallible locks then at the very least there should be a build option for what to do on lock failure – ignore, log… somewhere, block indefinitely, assert/abort, call some callback, …

I would really like OpenSSL to evolve to a shared-nothing-mutable threading design where the only cases where there are locks are when writers want to publish to readers new immutable versions of shared mutable state. That would reduce the number of lock-taking error cases to check to a number that would not elicit an error-checking-is-ETOOHARD response.

The only cases where acquiring a standard POSIX mutex can fail are cases which correspond to caller error which reflect a bug in our code in which an expected invariant is not being met. These situations are correctly addressed with an assertion (where they are even detectible) as there is no guarantee that anything about our internal state is valid in this case.

It should also be noted that the usage errors which can be made with a POSIX mutex (relock, unlock when not owner) also constitute undefined behaviour. There is no guarantee that these operations will fail ‘nicely’ when they are misused and if they are misused something has gone seriously wrong and we don’t have any kind of guarantee of sane behaviour.

hlandau on Aug 31, 2023

This will be a nice improvement, time-based releases are essential to keep a good health and kick away stuff that cannot be completed in time and going to cause trouble.

wtarreau on Aug 31, 2023

@mattcaswell my sense, looking at the current state of master, and the slides that @richsalz linked to, that the team is working hard on threaded performance in the master branch with an eye to a 3.2 or 3.3 release. Is that right? If so, what is the timeline for such a release? Is there an ETA for when a release policy will be agreed to?

We have implemented numerous lock related improvements in the master branch (which will become 3.2 when it is released). Mostly this is removal of unnecessary write locks (sometimes converting them to read locks, or sometimes removing them completely/avoiding them). We are currently targeting October for the 3.2 release. Beyond 3.2 we are shifting to a “time-based” release policy, so the next release after 3.2 (presumably 3.3) will be April 2024, and the one after that October 2024.

For the 3.2 release we are currently forming the release steering committee mentioned in the policy.

mattcaswell on Aug 31, 2023

I still suspect that not creating lots of SSL_CTXs will be a decent performance win.

paulidale on May 25, 2023

For MSQUIC the patches have almost no effect

It probably suggests that, for the MSQUIC workload test at least, performance issues are not related to lock contention (or not the locks I have addressed anyway). In master, in a default handshake, I was not encountering any other write locks aside from the ones resolved in my patches (after initial caches have been primed etc). Possibly 3.1/3.0 use slightly different write locks to master - I haven’t checked that. I wonder how many threads were in use for the MSQUIC test?

mattcaswell on May 25, 2023

From 1.1 to 3.1 handshakes/sec dropped by 18%. The other tests show negligible change.

richsalz on May 24, 2023

I have created a QuicTLS branch at https://github.com/quictls/openssl/tree/openssl-3.1.0%2Bquic%2Blocks that has all of the locking PR’s. @wtarreau, @nibanks please take a look. And anyone else, of course.

richsalz on May 23, 2023

And in any case it’s mostly a matter of asking. As a smaller opensource project, in haproxy we’re very frequently offered to test hardware from CPU vendors, NIC & crypto accelerators vendors, or cloud providers, and in fact a few times a year we have to decline by lack of developers time. Given the criticality of the openssl code’s scalability for many vendors, I see no reason why the project couldn’t benefit from such offerings: it will dictate whether or not their hardware will be correctly used in the next few years. Simply making a public statement on the web site that you need frequent (daily? weekly?) access to large-scale hardware for performance regression testing should be sufficient to get you what you need and validate the effect of intuitive changes. At least we never commit performance-related changes that were not measured, and that’s fortunate because more often than not the intuition is wrong.

wtarreau on May 15, 2023

Same here, at the very least when running quick tests on shared lab machines, we must not spend 10 minutes figuring which patches to download and apply on top of what, so a branch, tag or commit ID is really essential.

wtarreau on May 12, 2023

FYI, we (MsQuic) have dedicated hardware and existing tests that can be used to fairly easily measure perf improvements. The only problem (and possibly point of contention) is that we need a branch of quictls to point MsQuic to, to test. If someone can make that available, I can run the tests and have the data within an hour.

nibanks on May 12, 2023

I think there are some more patches to come yet…

mattcaswell on May 11, 2023

Random state is per-thread, re-seeded from the global when needed. See https://www.openssl.org/docs/man3.1/man7/EVP_RAND.html

richsalz on May 10, 2023

I implemented my idea in #20927.

mattcaswell on May 10, 2023

There remains a problem in the final provider_deactivate step because deactivating the provider involves taking another write lock on prov->flag_lock and it is less clear to me how we could avoid that one.

Actually…probably we could just decrement activatecnt not under lock. Only if the result is then low enough to trigger an actual deactivation then we could take hold of the write lock and really deactivate it. We would have to be careful to handle the situation where activatecnt is incremented again between it dropping low enough for a deactivation and us taking the write lock.

mattcaswell on May 10, 2023

ossl_provider_doall_activated() definitely takes a write lock that it absolutely should not. Perhaps the entire design of ossl_provider_doall_activated() is busted since its cb apparently is meant to be able to mutate.

I believe I understand the reasoning as to why that write lock is there. If we can find a way to refactor things to get rid of it that would be great.

The purpose of ossl_provider_doall_activated is to call a callback (cb) on all providers that are currently activated. However we could have the situation where another thread attempts to deactivate a provider while ossl_provider_doall_activated is running - thus resulting in the cb actually being called on a deactivated provider in error. In order to avoid this we iterate through all the providers and take hold of the flag_lock to allow us to query whether prov->flag_activated is set. This much could be done with a read lock. However, if it is set then we next call provider_activate. This has the effect of writing to prov->activatecnt (in order to increment the number of “activation” counts) and prov->flag_activated (in order to set it - even though we know it is already set). This is the reason that the write lock is obtained. Subsequently the write lock is then released.

The result is a list of providers which we know cannot be deactivated until activatecnt is decremented again and therefore we are free to call the callback not under lock without fear of another thread deactivating the provider.

The final step is then to call provider_deactivate on all those providers where we incremented the activatecnt to decrement it (and actually deactivate the provider if the activatecnt has fallen to a low enough value).

It occurs to me that we could optimise this. provider_activate() does a number of things but most of them are irrelevant in this situation. If we took a read lock on the flag_lock and determine that prov_flag_activated is set then all we actually need to do is increment prov->activatecnt. This could probably be done as an atomic.

There remains a problem in the final provider_deactivate step because deactivating the provider involves taking another write lock on prov->flag_lock and it is less clear to me how we could avoid that one.

mattcaswell on May 10, 2023

We (over?) engineered things to allow load/unload at any point.

nice to hear, even if it’s from an individual and not a statement from the OMC.

I don’t need to hear this from the OMC. I only need this fixed. The project needs this fixed. I’m going to leave the politics out.

Post EOL support will be available to premium support customers.

This is disappointing to hear from an open source project. It sounds like the kind of the thing a commercial software company would say. Especially since it sure seems like the 3.0 design has a flaw.

It’s certainly not good that people can’t move to 3.x due to the performance problems.

nicowilliams on May 10, 2023

Nope, FIPS and non-FIPS are easy to do via library contexts and properties. The intent was to support PKCS #11 providers where what device they are using is pluggable – thus providers need to be dynamic to some degree. We (over?) engineered things to allow load/unload at any point.

Just never remove any providers, only mark them disabled with a memory barrier.

nicowilliams on May 10, 2023

@hlandau

@wataash Would the application be providing its own locking routines at runtime or OpenSSL build time? The latter seems more of an option to me.

The model where the application provides its own lock functions is and should stay dead. Let’s never do that again in any project please.

What locking routines would applications offer that offer higher performance? Spinlocks?

The problem isn’t that OpenSSL’s lock functions are slow! There is nothing that an application could provide that would go faster!

The problem is as described in #20698 and #20715:

rw locks -while better than mutexes- are not good enough
some change in OpenSSL 3.0 made reader code paths do writer things and thus acquire write locks
when all callers acquire write locks instead of read locks, then rw locks degrade to mutexes
mutexes on hot data structures == slow

which means that:

alternative mutex/rw lock implementations cannot help!
the all-readers-are-now-writers-surprise!-haha bug needs to be fixed
the project should consider replacing rw locks with better abstractions

In #20715 I propose using one of any number of “user-land RCU”-like APIs to make all readers non-blocking/wait-free and thus fast. There seems to be some acceptance of that idea there. We should definitely explore that.

But the all-readers-are-now-writers-surprise!-haha bug should be fixed no matter what because a) that’s the fastest path to modest victory here, b) using a user-land RCU thing with all-readers-are-writers isn’t going to be any better than mutexes anyways.

The bug was -near as I can tell- trying to edit the provider list while looking for a suitable provider. This makes… anything that involves crypto slow on threaded processes.

nicowilliams on May 8, 2023

@paulidale Alright, if you want to go there…

The project “we’ve a lot of differing priorities to balance” can be summed up in one four-letter word: QUIC. Nobody publicly wants that. The project should fix what it broke. The project, and its paid staff, should make fixing the performance problems a priority. That’s my opinion, and I’m sorry if you don’t like it, but that’s a thing with open source, you have to take complaints as well as the kudo’s.

It’s never too late to change direction.

richsalz on May 5, 2023

See also #20698. I think that’s probably a dup of this one.

nicowilliams on May 4, 2023

@t8m I strongly disagree with you. Using a random lock implementation “just because it’s available” will remain a problem even when there is just one single lock in the end if there are corner cases where this lock can be strongly contended. Do you even know that pthread_rwlocks do not implement exponential back-off ? On the machine where I met these issues, our watchdog was sometimes triggering, crashing the process in OpenSSL. Hint: it only triggers if a thread has made zero progress for 2 seconds. This means that some of the locks called in my back by OpenSSL which chose to ignore the ones I provide are provoking pauses of more than two seconds of traffic on a single lock, just because they were never designed to be abused like this. Locking must always be the last resort, and when used they must be carefully tailored to match the application’s needs. There’s no one-size-fits-all in locking, and these ones are definitely a serious problem, as shown in the flamegraph by the way.

wtarreau on May 3, 2023

Users absolutely hate having to rebuild OpenSSL in field. They always fear to miss an update and to be at risk, this is the reason why we’re still having difficulties making some switch to QuicTLS. And nowadays given the pace at which wolfSSL is progressing, the huge bump in performance and the support for the QUIC API, for sure if one has to think about rebuilding in field, the opportunity to switch to another stack comes to mind.

wtarreau on May 3, 2023

I have worked on embedded systems with multiple levels of threading, and removal of the locking API in favor of pthread would have definitely impacted that (I left before 1.1.1 was released).

tmshort on May 3, 2023

Hi @mattcaswell

I could finally connect to the machine today and run the various tests for you. I compared openssl versions 1.1.1t, 3.0.8, 3.1.1-dev (commit 004bd8f97d), as well as wolfssl 5.5.4 (which uses very little locking). These were running on haproxy-2.8-dev8-c05d30, with regular pthread_rwlocks and with our own lighter rwlocks (plock). The machine is an intel Sapphire Rapids dual 8480+ (2 x 56 cores), hence 224 threads total. The tests only involved server mode (i.e. haproxy accepting connections) because the client mode in 3.0 is not even usable at all beyond just command-line tools.

The results are the following, in number of HTTPS connections per second, TLSv1.3, rsa2k cert:

openssl 1.1.1: pthread: 44k plock: 53k
openssl 3.0.8: pthread: 10-15k plock: 19k # 3.0 is unstable with pthread
openssl 3.1.1: pthread: 26k plock: 28k
wolfssl 5.5.4: pthread: 120-122k plock: 120-122k # reached load generator's limits.

In single core, all of them show the same performance, so the only difference is locking. I re-enabled the lock counting in the plock version, since we redefine the pthread_rwlock functions, it’s easy to do. These were measured over 1 million total connections. The results below are the number of calls to the lock functions per connection:

openssl 1.1.1: 120 rdlock, 276 wrlock, 396 unlock
openssl 3.0.8: 635 rdlock, 56 wrlock, 691 unlock
openssl 3.1.1: 164 rdlock, 57 wrlock, 220 unlock
wolfssl 5.5.4: 0 rdlock, 1 wrlock, 1 unlock

As you see it is not complicated, the performance is inversely proportional to the number of lock calls. It’s very likely that the extra rdlocks in 3.1 compared to 1.1.1 are still far too much because performed on very thin operations. Also what is not seen here are the atomic ops, but maybe 3.1.1 still has a lot more shared data that are also accessed via atomic ops. And based on the performance numbers above, even though 3.1 recovers on some of the massive defects of 3.0, 3.1 still requires twice as many servers as 1.1.1 to process the same workload.

As promised I got a callgraph on 3.1. It’s huge so I’m compressing it. I’m also attaching the flame graph but it’s not really exploitable beyond seeing what are the main direct callers of the locks. pthread_rwlock_* here are our functions. pl_wait_unlock_long is the relaxation function that’s waiting for someone else to release the lock, that’s why it appears a lot. I’m well conscious that when you’re dealing with 220 locks on a single connection, for sure these won’t be removed in an eye blink, as I’m certain it already took time to add them 😦

I do really think that one of the biggest mistake in 1.1 was the removal of the locking callbacks, because in addition to adding an overdose of locks, newer versions don’t even offer the possibility to switch to something lighter than pthread anymore, and that’s extremely frustrating. I tend to think that if there is one low hanging fruit for the short term, it’s definitely re-enabling a locking API. And please, not one using a generic unlock, but a separate read_unlock and write_unlock, which the user may both map to a the same generic unlock function if they want, but generally speaking, having to read a lock just to figure how to unlock it doubles the unlocking cost just for wrong API reasons, and that’s exactly the type of thing that end up with a heavy mechanism like you have in pthread.

Hoping this helps. ossl31-pthemu-perfreport.txt.gz ossl31-pthemu-perfflame

wtarreau on Apr 28, 2023

Hi Matt,

most of the notes I had were regarding 1.1.1 which we’ll stick to, and for now I don’t have access to this machine. But I think I will eventually again relatively soon, so as time permits I’ll try to collect more info. The thing is that it’s difficult to observe the call places because the time wasted calling locks is not evenly spread due to context-switches and sleeps in the kernel. In my case it’s by redefining my own pthread_rwlock_*() functions that I had the idea of just adding a call counter and was amazed by the number of calls per connection. I do have some variants of these functions for debugging purposes which disable exponential back-off and which more easily permit to collect call graphs using perf, they’re just a bit less easy to use and less precise.

wtarreau on Feb 22, 2023

I instrumented the code in hope to find “the” culprit and found something like 80 calls to any flavor of pthread_rwlock_*() in a single accept/get/response/close cycle!

@wtarreau - I wonder if you have any documentation or notes from that time which might help us identify which specific locks were causing the problem? I’m assuming the “80 calls” were not from 80 different call sites, but certain calls being repeated multiple times?

mattcaswell on Feb 22, 2023

@wtarreau, there was a bug in 3.0 where multiple threads ended up completely serialised which killed multiprocessor performance and, from your description, I think you hit this. This was fixed in #18151 which was merged to later 3.0 releases.

OpenSSL does use atomic operations for reference counting (mostly). It was quite a performance win when it was introduced.

@wfurt, the rule of thumb is pre-load what you’re going to use. Fetch is an additional cost over 1.1.1, so pre-fetch. Certificate/key loading is slower, so pre-load where possible. Avoid the convenience calls like EVP_sha256() because these do a (late) fetch internally.

@richsalz, the post quantum provider is using provider based decoders. I am glad that the flexibility was included to enable such.

paulidale on Feb 21, 2023

cc @wfurt to possibly help here

Perf with 3.0 is really a huge issue with MsQuic and I’m not sure we can really adopt it seriously as is. We can’t just artificially work around things just for the sake of making a perf test look good. We need to make sure to exercise real world scenarios.

nibanks on Feb 20, 2023

We need to be able to support a provider loading a key in a format that libcrypto doesn’t know about and this means multiple parsing attempts. T

The project went too far. In order to support things that aren’t currently available, the project is hurting – a lot – current users. That is the wrong trade-off. (This kind of “make it flexible, we can’t see future needs” has been an endemic problem in OpenSSL code since the beginning, going back to SSLeay.)

@nibanks you might want to take a look at Pauli’s suggestions above. My feeling, however, is that “quite a bit heavier” for flexibility that nobody can currently use is a mistake.

richsalz on Feb 20, 2023

This is on my to look at list but I’m pretty busy following last week’s face to face and I’ve a huge pile of other stuff to look at.

@beldmit, I have difficulty believing that strndup/free is the culprit. Locking is more likely as @richsalz reported. 3.1 should have improved things in this regard by avoiding the more egregious locking problems. I’m hopeful but I cannot say if any others remain at this point. Time will tell.

However, I’m not certain that locking fixes alone will be sufficient. @richsalz’s code directly encounters one of the known problems: decoders are slower in 3.0 due to the provider architecture. We need to be able to support a provider loading a key in a format that libcrypto doesn’t know about and this means multiple parsing attempts. There simply isn’t any other option. That doesn’t mean we can’t streamline things further – we intend to look more deeply into this but contractor time is a limited resource and there are many other priorities. Rest assured that performance is definitely on the backlog and is important to the project.

As for MSQUIC, there is a quick (sic) win available by refactoring the QUIC code to not create a new SSL_CTX on every connection. Doing this in 1.1.1 is a moderately heavy operation, doing it in 3.0 is quite a bit heavier. I suspect that avoiding this will improve things significantly, both for 1.1.1 and 3.0/3.1. Unfortunately, I do not see myself getting time to contribute such a patch to MSQUIC any time soon.

Volunteers who can contribute such changes are welcomed of course.

paulidale on Feb 20, 2023

ossl_namemap_name2num_n with descendants (including locks/unlocks) liiks like a hotspot to me - 45% according to callgrind in master. @paulidale @hlandau - could you please take a look?

beldmit on Feb 16, 2023

I have some good news. Current master (and 3.1, I presume) is 4 times faster than 3.0.

I will still look if there is smth obvious to fix in 3.1

beldmit on Feb 16, 2023