openssl: 3.0 performance degraded due to locking
Using some internal QUIC tests, we see this timing in 1.1:
$ USE_GQUIC_VERSIONS=1 perf record -F 999 -g ./quic_lib_tests --gtest_repeat=100 --gtest_filter=*ZeroRttDisabled*
$ perf report
+ 31.73% 31.55% quic_lib_tests quic_lib_tests [.] bn_sqr8x_internal
+ 9.28% 9.28% quic_lib_tests quic_lib_tests [.] mul4x_internal
+ 4.91% 4.91% quic_lib_tests quic_lib_tests [.] sha256_block_data_order_avx
In 3.0 we see this:
$ USE_GQUIC_VERSIONS=1 perf record -F 999 -g ./quic_lib_tests --gtest_repeat=100 --gtest_filter=*ZeroRttDisabled*
$ perf report
+ 11.02% 10.99% quic_lib_tests quic_lib_tests [.] bn_sqr8x_internal
+ 8.38% 8.08% quic_lib_tests libpthread-2.31.so [.] __pthread_rwlock_rdlock
+ 7.65% 7.51% quic_lib_tests libpthread-2.31.so [.] __pthread_rwlock_unlock
+ 4.98% 4.78% quic_lib_tests quic_lib_tests [.] getrn
+ 4.14% 4.11% quic_lib_tests quic_lib_tests [.] mul4x_internal
+ 3.37% 2.57% quic_lib_tests quic_lib_tests [.] ossl_tolower
3.30% 3.30% quic_lib_tests quic_lib_tests [.] ossl_lh_strcasehash
+ 2.72% 2.13% quic_lib_tests quic_lib_tests [.] OPENSSL_strcasecmp
+ 2.29% 2.05% quic_lib_tests quic_lib_tests [.] ossl_lib_ctx_get_data
+ 1.93% 1.93% quic_lib_tests quic_lib_tests [.] sha256_block_data_order_avx
This seems to be part of OSSL_DECODER_CTX_new_for_pkey 16% of the time is spent in locking, on a single threaded binary. And 10% is in a string hashtable lookup.
If anyone on the project is going to look at this, I will try to get a small reproducer. But our the time for our QUIC tests is doubling.
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 4
- Comments: 183 (99 by maintainers)
Commits related to this issue
- Avoid taking a write lock in ossl_provider_doall_activated() We refactor ossl_provider_doall_activated() so that we only need to take a read lock instead of a write lock for the flag_lock. This shoul... — committed to mattcaswell/openssl by mattcaswell a year ago
- Avoid taking a write lock in ossl_provider_doall_activated() We refactor ossl_provider_doall_activated() so that we only need to take a read lock instead of a write lock for the flag_lock. This shoul... — committed to mattcaswell/openssl by mattcaswell a year ago
- Avoid taking a write lock in RAND_get_rand_method() The function RAND_get_rand_method() is called everytime RAND_bytes() or RAND_priv_bytes() is called. We were obtaining a write lock in order to fin... — committed to mattcaswell/openssl by mattcaswell a year ago
- Avoid taking a write lock in RAND_get_rand_method() The function RAND_get_rand_method() is called every time RAND_bytes() or RAND_priv_bytes() is called. We were obtaining a write lock in order to fi... — committed to mattcaswell/openssl by mattcaswell a year ago
- Don't take a write lock when freeing an EVP_PKEY When freeing the last reference to an EVP_PKEY there is no point in taking the lock for the key. It is the last reference and is being freed so must o... — committed to mattcaswell/openssl by mattcaswell a year ago
- When we're just reading EX_CALLBACK data just get a read lock The crypto_ex_data code was always obtaining a write lock in all functions regardless of whether we were only reading EX_CALLBACK data or... — committed to mattcaswell/openssl by mattcaswell a year ago
- Don't get a write lock when querying the parent reseed_count or strength A read lock should be sufficient when all we are doing is querying the parent DRBG reseed_count or strength. Partially fixes ... — committed to mattcaswell/openssl by mattcaswell a year ago
- Modify ENGINE_pkey_asn1_find_str() to use a read lock instead of a write ENGINE_pkey_asn1_find_str() does not make any modifications to fields controlled by the global_engine_lock. The only change ma... — committed to mattcaswell/openssl by mattcaswell a year ago
- Avoid an unneccessary lock if we didn't add anything to the store Partially fixes #20286 — committed to mattcaswell/openssl by mattcaswell a year ago
- Don't take a write lock to retrieve a value from a stack ossl_x509_store_ctx_get_by_subject() was taking a write lock for the store, but was only (usually) retrieving a value from the stack of object... — committed to mattcaswell/openssl by mattcaswell a year ago
- Optimise locking in rsa_get_blinding() We optimise locking in rsa_get_blinding() so that we normally take a read lock, and only fallback to a write lock if we need to. This will be very slightly slow... — committed to mattcaswell/openssl by mattcaswell a year ago
- Refactor the DRBG implementations to manage locking themselves Previously the EVP layer would call lock and unlock functions on the underlying DRBG implementation to say when a lock should be acquire... — committed to mattcaswell/openssl by mattcaswell a year ago
- Enable obtaining certain DRBG params without a lock Even if a DRBG has locking enabled on it, there are certain parameters which are still safe to obtain even without a lock. The max_request value is... — committed to mattcaswell/openssl by mattcaswell a year ago
- Refactor the DRBG implementations to manage locking themselves Previously the EVP layer would call lock and unlock functions on the underlying DRBG implementation to say when a lock should be acquire... — committed to mattcaswell/openssl by mattcaswell a year ago
- Enable obtaining certain DRBG params without a lock Even if a DRBG has locking enabled on it, there are certain parameters which are still safe to obtain even without a lock. The max_request value is... — committed to mattcaswell/openssl by mattcaswell a year ago
- Refactor the DRBG implementations to manage locking themselves Previously the EVP layer would call lock and unlock functions on the underlying DRBG implementation to say when a lock should be acquire... — committed to mattcaswell/openssl by mattcaswell a year ago
- Enable obtaining certain DRBG params without a lock Even if a DRBG has locking enabled on it, there are certain parameters which are still safe to obtain even without a lock. The max_request value is... — committed to mattcaswell/openssl by mattcaswell a year ago
- Refactor the DRBG implementations to manage locking themselves Previously the EVP layer would call lock and unlock functions on the underlying DRBG implementation to say when a lock should be acquire... — committed to mattcaswell/openssl by mattcaswell a year ago
- Enable obtaining certain DRBG params without a lock Even if a DRBG has locking enabled on it, there are certain parameters which are still safe to obtain even without a lock. The max_request value is... — committed to mattcaswell/openssl by mattcaswell a year ago
- Avoid taking a write lock in ossl_provider_doall_activated() We refactor ossl_provider_doall_activated() so that we only need to take a read lock instead of a write lock for the flag_lock. This shoul... — committed to quictls/openssl by mattcaswell a year ago
So I have the following:
Am I missing any? I am working on merging this into a branch of quictls/openssl so that people can test.
Here is a small reproducer in C++, written by an Akamai colleague.
The code is at https://gist.github.com/richsalz/31a98a3095fa36ab6a66082a0c028a6f
Here are the compile commands and timing. This small reproducer is 17 times slower. When used in our QUIC implementation, small transfers average twice as slowly. Microsoft’s MSQUIC (their production-quality open source stack on Windows and Azure) avoids use of 3.0 for this exact reason; they use the 1.1 branch of @quictls/openssl
Imagine that
struct ssl_ctx_sthas a field of typestruct ssl_ctx_st *that points to an immutable version of itself, and that you use RCU or similar to a) replace that pointer with a pointer to the next new immutable version, b) to read it.There can be only one user doing this in a process, and it must do this before all uses of OpenSSL. If two users try this, then the first must win and the second must lose, and the first user’s lock callbacks must be the only ones that get used. If any user causes OpenSSL [self-]initialization to happen then all subsequent attempts to initialize OpenSSL must be no-ops.
And I strenuously object to libraries not having self-initialization. Optional initialization when you’re the first, that’s OK but lame, but really, no, OpenSSL’s threading performance should not depend on you providing better lock callbacks. The way to make OpenSSL thread performance not suck is to make OpenSSL do more shared-nothing-mutable, as that would greatly reduce the amount of lock contention.
Yes, @richsalz, we are in sync on this list.
Or, if deactivation would not actually ever unload the provider, just make it unavailable for further fetches, we could just atomically increment or decrement the activation count. It actually does NOT matter what any other thread does in such case - an application cannot expect consistent behavior if multiple threads are randomly loading/unloading providers in the same library context.
We could truly unload the provider only when a libctx is being tear down.
Since the team seems to no longer be spending any effort on reducing locking, downstream users have little choice.
But I disagree with your comment, @t8m: lightweight locking is a completely reasonable option, especially if the heavy guarantees are not needed.
The other thing to bear in mind is that there are numerous performance improvements in 3.1. It’s still not at 1.1.1 levels, but its better than 3.0. If you do re-run the exercise, doing it against 3.1 would be most helpful for us to figure out what we still have left to do.
On our side, we spent some time trying to use a simple HTTP client / load generator on Ubuntu 22 and were stumped to see it saturate at around 400 connections per second for 48 threads while on other machines it worked fine. We noticed that there was something completely odd regarding locking as it was impossible for the process to use much more than one CPU and that most of the time was spent in
futex_wake()and friends. After trying many things we finally locally built 1.1.1 and linked against it and suddenly the performance reached 74000 cps, or almost 200 times more.Additionally, on the server side we already noticed an extreme use of locking, with chains starting like this one and involving pthread_rwlocks:
I instrumented the code in hope to find “the” culprit and found something like 80 calls to any flavor of
pthread_rwlock_*()in a single accept/get/response/close cycle! 1.1.1 already had quite a bunch of them (I don’t remember how many) but significantly less overhead, as if it was used on shorter chains. In the end we reimplemented our own locks (same principle as the lock callbacks that were unfortunately dropped from 1.1) and could more than double the server-side performance thanks to using much lighter locks.So yes, there definitely is a big locking problem there. The rule of thumb for a call to any
pthread_*function should normally be “am I really willing to give up the CPU and switch to another task, possibly thrashing all the L1 cache here?”. Of course that’s not exact but in terms of cost and impacts it’s extremely close to reality.pthread_*functions should only be for high-level code (browsers etc) and rare cases. It’s way too expensive for low-level or library code.I failed to find any single occurrence of an atomic operation in the whole codebase, which makes me think that all of them are instead implemented by hand around a lock, so that’s a dead end, when you consider that scalable code normally tries hard to avoid atomic ops because they slow things down, and that
pthread_*functions do involve atomic ops and a syscall in the contented case that is at least 100 times slower than the atomic op itself…For now we’re condemned to stick to 1.1.1, as 3.0 is not just “degraded”, it’s simply not realistically usable at all for anything beyond the
opensslcommand-line utility for us. This will pose security issues in field over time once 1.1.1 stops being supported, as power users are only starting to discover the big regression now by progressively switching to new distros. But for now we have no other choice.Dear @nibanks I’d split things apart.
The redesign to provider causes significant slowdown in some specific areas. It’s a problem to be solved on OpenSSL level for corner cases or in general.
The architectural difference between 1.1.1 and 3.0 series is significant itself. So the approaches worked well for 1.1.1 should be reconsidered for 3.0. @paulidale provided a totally relevant recommendation and ignoring it looks like a suboptimal decision to me.
Replacing LHASH, both API and implementation, would get a thumbs up from me. The existing LHASH code can be left as a near-deprecated public API for applications which still use it. Unfortunately LHASH is used as part of our public API in places so we can’t realistically formally deprecate it.
I certainly have some ideas on the design of a replacement (internal) API.
I’d like to see any replacement minimise allocations as much as possible and maybe even support moves for objects which can support it (partially intrusive design). I also want it to generate code which allows hash and equality function pointer calls to be devirtualised and inlined.
Some food for thought:
I’ve tried the test_25519 reproducer with all the current 3.x branches and 1.1.1. Here are the results:
I think we are pretty near the 1.1.1 performance on this particular benchmark that’s basically testing a repeated decoding of 25519 key with 3.2.
[moved reply comment to https://github.com/openssl/openssl/issues/21918]
Regarding OpenSSL’s excessive use of shared mutable state: wholly agree. This was an architectural mistake and we need to back out of it and find a way to do it that doesn’t break API. So I’m basically in support of the architectural destination you’re elaborating on here.
If, and only if only applications ever set lock callbacks, and never ever ever libraries that use OpenSSL, and still, if OpenSSL self-initializes for everything on first-need, then that would be almost OK.
The program (the application,
main()) may not even know that it has to initialize OpenSSL because libraries that the program uses use OpenSSL. For example, just using the name service switch w/o annscdcan cause this to happen via name service switch plugins that run in the application processes without the application authors even knowing.Not only that, but OpenSSL uses versioned symbols now and multiple versions can now co-exist in the same process. Yes, I’ve got apps that indirectly have OpenSSL 1.1 and 3.0 loaded in the same process, and never the twain meet. But each object gets to link with only one version of OpenSSL, so the only way the application could set lock callbacks for all versions would be for it to
dlopen()them and usedlfsym(). (Sure, if I needed to set different lock primitives, I’d probably only have to do it for one version of OpenSSL, so that might work.)Going back to global lock callbacks would be a disaster if ever any OpenSSL-using libraries install their own lock callbacks. And I guarantee that some developer somewhere will misunderstand and try to initialize OpenSSL lock callbacks in their library.
You also have ELF symbol interposition to help you in this case. Symbol interposition is actually a very nice solution because you don’t even have to change the application if you don’t want to or can’t – just use
LD_PRELOAD.OpenSSL just has to get their design for threaded scalability right. There should be no need to switch to lock primitives that use spinlocks or anything of the sort.
Read-write locks are a performance disaster. So are mutexes for that matter. The correct answer is for OpenSSL to switch to a shared-nothing-mutable model, but that’s clearly not happening before 3.4 now 😢.
In general 100% yes. It is for an application not a library to decide how to handle errors, and calling abort would preempt this. The only exception as I see it is assertion failures — and that is the sole circumstance where OpenSSL (correctly) uses abort today. (Allowing an application to provide a custom assertion failure handler callback would also be reasonable, but after that point any attempt to reenter the library should rightly be considered UB.)
The only cases where acquiring a standard POSIX mutex can fail are cases which correspond to caller error which reflect a bug in our code in which an expected invariant is not being met. These situations are correctly addressed with an assertion (where they are even detectible) as there is no guarantee that anything about our internal state is valid in this case.
It should also be noted that the usage errors which can be made with a POSIX mutex (relock, unlock when not owner) also constitute undefined behaviour. There is no guarantee that these operations will fail ‘nicely’ when they are misused and if they are misused something has gone seriously wrong and we don’t have any kind of guarantee of sane behaviour.
This will be a nice improvement, time-based releases are essential to keep a good health and kick away stuff that cannot be completed in time and going to cause trouble.
We have implemented numerous lock related improvements in the master branch (which will become 3.2 when it is released). Mostly this is removal of unnecessary write locks (sometimes converting them to read locks, or sometimes removing them completely/avoiding them). We are currently targeting October for the 3.2 release. Beyond 3.2 we are shifting to a “time-based” release policy, so the next release after 3.2 (presumably 3.3) will be April 2024, and the one after that October 2024.
For the 3.2 release we are currently forming the release steering committee mentioned in the policy.
I still suspect that not creating lots of SSL_CTXs will be a decent performance win.
It probably suggests that, for the MSQUIC workload test at least, performance issues are not related to lock contention (or not the locks I have addressed anyway). In master, in a default handshake, I was not encountering any other write locks aside from the ones resolved in my patches (after initial caches have been primed etc). Possibly 3.1/3.0 use slightly different write locks to master - I haven’t checked that. I wonder how many threads were in use for the MSQUIC test?
From 1.1 to 3.1 handshakes/sec dropped by 18%. The other tests show negligible change.
I have created a QuicTLS branch at https://github.com/quictls/openssl/tree/openssl-3.1.0%2Bquic%2Blocks that has all of the locking PR’s. @wtarreau, @nibanks please take a look. And anyone else, of course.
And in any case it’s mostly a matter of asking. As a smaller opensource project, in haproxy we’re very frequently offered to test hardware from CPU vendors, NIC & crypto accelerators vendors, or cloud providers, and in fact a few times a year we have to decline by lack of developers time. Given the criticality of the openssl code’s scalability for many vendors, I see no reason why the project couldn’t benefit from such offerings: it will dictate whether or not their hardware will be correctly used in the next few years. Simply making a public statement on the web site that you need frequent (daily? weekly?) access to large-scale hardware for performance regression testing should be sufficient to get you what you need and validate the effect of intuitive changes. At least we never commit performance-related changes that were not measured, and that’s fortunate because more often than not the intuition is wrong.
Same here, at the very least when running quick tests on shared lab machines, we must not spend 10 minutes figuring which patches to download and apply on top of what, so a branch, tag or commit ID is really essential.
FYI, we (MsQuic) have dedicated hardware and existing tests that can be used to fairly easily measure perf improvements. The only problem (and possibly point of contention) is that we need a branch of quictls to point MsQuic to, to test. If someone can make that available, I can run the tests and have the data within an hour.
I think there are some more patches to come yet…
Random state is per-thread, re-seeded from the global when needed. See https://www.openssl.org/docs/man3.1/man7/EVP_RAND.html
I implemented my idea in #20927.
Actually…probably we could just decrement
activatecntnot under lock. Only if the result is then low enough to trigger an actual deactivation then we could take hold of the write lock and really deactivate it. We would have to be careful to handle the situation whereactivatecntis incremented again between it dropping low enough for a deactivation and us taking the write lock.I believe I understand the reasoning as to why that write lock is there. If we can find a way to refactor things to get rid of it that would be great.
The purpose of
ossl_provider_doall_activatedis to call a callback (cb) on all providers that are currently activated. However we could have the situation where another thread attempts to deactivate a provider whileossl_provider_doall_activatedis running - thus resulting in the cb actually being called on a deactivated provider in error. In order to avoid this we iterate through all the providers and take hold of theflag_lockto allow us to query whetherprov->flag_activatedis set. This much could be done with a read lock. However, if it is set then we next callprovider_activate. This has the effect of writing toprov->activatecnt(in order to increment the number of “activation” counts) andprov->flag_activated(in order to set it - even though we know it is already set). This is the reason that the write lock is obtained. Subsequently the write lock is then released.The result is a list of providers which we know cannot be deactivated until
activatecntis decremented again and therefore we are free to call the callback not under lock without fear of another thread deactivating the provider.The final step is then to call
provider_deactivateon all those providers where we incremented theactivatecntto decrement it (and actually deactivate the provider if theactivatecnthas fallen to a low enough value).It occurs to me that we could optimise this.
provider_activate()does a number of things but most of them are irrelevant in this situation. If we took a read lock on theflag_lockand determine thatprov_flag_activatedis set then all we actually need to do is incrementprov->activatecnt. This could probably be done as an atomic.There remains a problem in the final
provider_deactivatestep because deactivating the provider involves taking another write lock onprov->flag_lockand it is less clear to me how we could avoid that one.I don’t need to hear this from the OMC. I only need this fixed. The project needs this fixed. I’m going to leave the politics out.
It’s certainly not good that people can’t move to 3.x due to the performance problems.
Just never remove any providers, only mark them disabled with a memory barrier.
@hlandau
The model where the application provides its own lock functions is and should stay dead. Let’s never do that again in any project please.
The problem isn’t that OpenSSL’s lock functions are slow! There is nothing that an application could provide that would go faster!
The problem is as described in #20698 and #20715:
which means that:
In #20715 I propose using one of any number of “user-land RCU”-like APIs to make all readers non-blocking/wait-free and thus fast. There seems to be some acceptance of that idea there. We should definitely explore that.
But the all-readers-are-now-writers-surprise!-haha bug should be fixed no matter what because a) that’s the fastest path to modest victory here, b) using a user-land RCU thing with all-readers-are-writers isn’t going to be any better than mutexes anyways.
The bug was -near as I can tell- trying to edit the provider list while looking for a suitable provider. This makes… anything that involves crypto slow on threaded processes.
@paulidale Alright, if you want to go there…
The project “we’ve a lot of differing priorities to balance” can be summed up in one four-letter word: QUIC. Nobody publicly wants that. The project should fix what it broke. The project, and its paid staff, should make fixing the performance problems a priority. That’s my opinion, and I’m sorry if you don’t like it, but that’s a thing with open source, you have to take complaints as well as the kudo’s.
It’s never too late to change direction.
See also #20698. I think that’s probably a dup of this one.
@t8m I strongly disagree with you. Using a random lock implementation “just because it’s available” will remain a problem even when there is just one single lock in the end if there are corner cases where this lock can be strongly contended. Do you even know that pthread_rwlocks do not implement exponential back-off ? On the machine where I met these issues, our watchdog was sometimes triggering, crashing the process in OpenSSL. Hint: it only triggers if a thread has made zero progress for 2 seconds. This means that some of the locks called in my back by OpenSSL which chose to ignore the ones I provide are provoking pauses of more than two seconds of traffic on a single lock, just because they were never designed to be abused like this. Locking must always be the last resort, and when used they must be carefully tailored to match the application’s needs. There’s no one-size-fits-all in locking, and these ones are definitely a serious problem, as shown in the flamegraph by the way.
Users absolutely hate having to rebuild OpenSSL in field. They always fear to miss an update and to be at risk, this is the reason why we’re still having difficulties making some switch to QuicTLS. And nowadays given the pace at which wolfSSL is progressing, the huge bump in performance and the support for the QUIC API, for sure if one has to think about rebuilding in field, the opportunity to switch to another stack comes to mind.
I have worked on embedded systems with multiple levels of threading, and removal of the locking API in favor of pthread would have definitely impacted that (I left before 1.1.1 was released).
Hi @mattcaswell
I could finally connect to the machine today and run the various tests for you. I compared openssl versions 1.1.1t, 3.0.8, 3.1.1-dev (commit 004bd8f97d), as well as wolfssl 5.5.4 (which uses very little locking). These were running on haproxy-2.8-dev8-c05d30, with regular pthread_rwlocks and with our own lighter rwlocks (plock). The machine is an intel Sapphire Rapids dual 8480+ (2 x 56 cores), hence 224 threads total. The tests only involved server mode (i.e. haproxy accepting connections) because the client mode in 3.0 is not even usable at all beyond just command-line tools.
The results are the following, in number of HTTPS connections per second, TLSv1.3, rsa2k cert:
openssl 1.1.1: pthread: 44k plock: 53kopenssl 3.0.8: pthread: 10-15k plock: 19k # 3.0 is unstable with pthreadopenssl 3.1.1: pthread: 26k plock: 28kwolfssl 5.5.4: pthread: 120-122k plock: 120-122k # reached load generator's limits.In single core, all of them show the same performance, so the only difference is locking. I re-enabled the lock counting in the plock version, since we redefine the pthread_rwlock functions, it’s easy to do. These were measured over 1 million total connections. The results below are the number of calls to the lock functions per connection:
openssl 1.1.1: 120 rdlock, 276 wrlock, 396 unlockopenssl 3.0.8: 635 rdlock, 56 wrlock, 691 unlockopenssl 3.1.1: 164 rdlock, 57 wrlock, 220 unlockwolfssl 5.5.4: 0 rdlock, 1 wrlock, 1 unlockAs you see it is not complicated, the performance is inversely proportional to the number of lock calls. It’s very likely that the extra rdlocks in 3.1 compared to 1.1.1 are still far too much because performed on very thin operations. Also what is not seen here are the atomic ops, but maybe 3.1.1 still has a lot more shared data that are also accessed via atomic ops. And based on the performance numbers above, even though 3.1 recovers on some of the massive defects of 3.0, 3.1 still requires twice as many servers as 1.1.1 to process the same workload.
As promised I got a callgraph on 3.1. It’s huge so I’m compressing it. I’m also attaching the flame graph but it’s not really exploitable beyond seeing what are the main direct callers of the locks. pthread_rwlock_* here are our functions. pl_wait_unlock_long is the relaxation function that’s waiting for someone else to release the lock, that’s why it appears a lot. I’m well conscious that when you’re dealing with 220 locks on a single connection, for sure these won’t be removed in an eye blink, as I’m certain it already took time to add them 😦
I do really think that one of the biggest mistake in 1.1 was the removal of the locking callbacks, because in addition to adding an overdose of locks, newer versions don’t even offer the possibility to switch to something lighter than pthread anymore, and that’s extremely frustrating. I tend to think that if there is one low hanging fruit for the short term, it’s definitely re-enabling a locking API. And please, not one using a generic unlock, but a separate read_unlock and write_unlock, which the user may both map to a the same generic unlock function if they want, but generally speaking, having to read a lock just to figure how to unlock it doubles the unlocking cost just for wrong API reasons, and that’s exactly the type of thing that end up with a heavy mechanism like you have in pthread.
Hoping this helps. ossl31-pthemu-perfreport.txt.gz
Hi Matt,
most of the notes I had were regarding 1.1.1 which we’ll stick to, and for now I don’t have access to this machine. But I think I will eventually again relatively soon, so as time permits I’ll try to collect more info. The thing is that it’s difficult to observe the call places because the time wasted calling locks is not evenly spread due to context-switches and sleeps in the kernel. In my case it’s by redefining my own pthread_rwlock_*() functions that I had the idea of just adding a call counter and was amazed by the number of calls per connection. I do have some variants of these functions for debugging purposes which disable exponential back-off and which more easily permit to collect call graphs using perf, they’re just a bit less easy to use and less precise.
@wtarreau - I wonder if you have any documentation or notes from that time which might help us identify which specific locks were causing the problem? I’m assuming the “80 calls” were not from 80 different call sites, but certain calls being repeated multiple times?
@wtarreau, there was a bug in 3.0 where multiple threads ended up completely serialised which killed multiprocessor performance and, from your description, I think you hit this. This was fixed in #18151 which was merged to later 3.0 releases.
OpenSSL does use atomic operations for reference counting (mostly). It was quite a performance win when it was introduced.
@wfurt, the rule of thumb is pre-load what you’re going to use. Fetch is an additional cost over 1.1.1, so pre-fetch. Certificate/key loading is slower, so pre-load where possible. Avoid the convenience calls like
EVP_sha256()because these do a (late) fetch internally.@richsalz, the post quantum provider is using provider based decoders. I am glad that the flexibility was included to enable such.
cc @wfurt to possibly help here
Perf with 3.0 is really a huge issue with MsQuic and I’m not sure we can really adopt it seriously as is. We can’t just artificially work around things just for the sake of making a perf test look good. We need to make sure to exercise real world scenarios.
The project went too far. In order to support things that aren’t currently available, the project is hurting – a lot – current users. That is the wrong trade-off. (This kind of “make it flexible, we can’t see future needs” has been an endemic problem in OpenSSL code since the beginning, going back to SSLeay.)
@nibanks you might want to take a look at Pauli’s suggestions above. My feeling, however, is that “quite a bit heavier” for flexibility that nobody can currently use is a mistake.
This is on my to look at list but I’m pretty busy following last week’s face to face and I’ve a huge pile of other stuff to look at.
@beldmit, I have difficulty believing that strndup/free is the culprit. Locking is more likely as @richsalz reported. 3.1 should have improved things in this regard by avoiding the more egregious locking problems. I’m hopeful but I cannot say if any others remain at this point. Time will tell.
However, I’m not certain that locking fixes alone will be sufficient. @richsalz’s code directly encounters one of the known problems: decoders are slower in 3.0 due to the provider architecture. We need to be able to support a provider loading a key in a format that libcrypto doesn’t know about and this means multiple parsing attempts. There simply isn’t any other option. That doesn’t mean we can’t streamline things further – we intend to look more deeply into this but contractor time is a limited resource and there are many other priorities. Rest assured that performance is definitely on the backlog and is important to the project.
As for MSQUIC, there is a quick (sic) win available by refactoring the QUIC code to not create a new SSL_CTX on every connection. Doing this in 1.1.1 is a moderately heavy operation, doing it in 3.0 is quite a bit heavier. I suspect that avoiding this will improve things significantly, both for 1.1.1 and 3.0/3.1. Unfortunately, I do not see myself getting time to contribute such a patch to MSQUIC any time soon.
Volunteers who can contribute such changes are welcomed of course.
ossl_namemap_name2num_nwith descendants (including locks/unlocks) liiks like a hotspot to me - 45% according to callgrind in master. @paulidale @hlandau - could you please take a look?I have some good news. Current master (and 3.1, I presume) is 4 times faster than 3.0.
I will still look if there is smth obvious to fix in 3.1