openssl: Memory leaks with shared windows builds
I’ve been helping maintain openssl & cryptography in the conda-forge ecosystem, and one of our biggest stumbling blocks in that regard has been upgrading to OpenSSL 3. In large part that’s due to me not being confident enough to raise an issue because I didn’t feel I understood the failures well enough at the time.
After sorting out most build errors, we’ve been stuck that the very thorough cryptography test suite notices memory leaks (only) on windows. At the time I opened https://github.com/pyca/cryptography/issues/7379, but upstream cryptography cannot really help with debugging this due to infrastructural reasons (they compile against a prepared static build, we need to build against shared libs).
Even more curiously (and another reason that kept me from opening this issue here) is that the failures only show up for CPython 3.7, 3.9 & 3.10, but not for CPython 3.8 or PyPy 3.8, 3.9. This makes it look like some sort of interaction with the Python runtime is happening, but truth be told, I don’t know. Maybe the memory leaks are always there and being hidden through other interferences?
In any case, I wanted to kindly ask for help as this state of affairs keeps us in an uncomfortable in-between-state - essentially we’re halfway through a so-called “migration” across all our packages, currently building everything for 1.1.1 & 3.0.0, until the remaining packages (all dependents of cryptography) have finished moving to 3.0.0, and we can then drop the 1.1.1-flavoured builds.
The windows builds script that we’re using is here, config step reproduced here for convenience:
:: Configure step
perl configure VC-WIN64A ^
--prefix=%LIBRARY_PREFIX% ^
--openssldir=%LIBRARY_PREFIX% ^
enable-legacy ^
no-fips ^
no-module ^
no-zlib ^
shared
The failures can then be seen in e.g. https://github.com/conda-forge/cryptography-feedstock/pull/106 (previously https://github.com/conda-forge/cryptography-feedstock/pull/80), with a corresponding CI run here (please ignore the failures on linux-aarch64 & linux-ppc64le, those are unrelated cross-compilation woes on the Rust side).
The failures all point to a memory leak occurring at crypto/provider_core.c:1616, though not all tests from cryptography/tests/hazmat/backends/test_openssl_memleak.py are failing, only:
=========================== short test summary info ===========================
FAILED tests/hazmat/backends/test_openssl_memleak.py::TestOpenSSLMemoryLeaks::test_x509_csr_extensions
FAILED tests/hazmat/backends/test_openssl_memleak.py::TestOpenSSLMemoryLeaks::test_x25519_pubkey_from_private_key
FAILED tests/hazmat/backends/test_openssl_memleak.py::TestOpenSSLMemoryLeaks::test_load_pkcs12_key_and_certificates[pkcs12/cert-aes256cbc-no-key.p12]
FAILED tests/hazmat/backends/test_openssl_memleak.py::TestOpenSSLMemoryLeaks::test_load_pkcs12_key_and_certificates[pkcs12/cert-key-aes256cbc.p12]
FAILED tests/hazmat/backends/test_openssl_memleak.py::TestOpenSSLMemoryLeaks::test_write_pkcs12_key_and_certificates
= 5 failed, 2816 passed, 56 skipped, 296 warnings, 85319 subtests passed in 340.54s (0:05:40) =
Happy to provide further information (or test patches on our infra), but I’ve reached the end of the line here, and would be very grateful for some help. 😅
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (10 by maintainers)
Oh! I figured out what’s going on. All of the info you need is up above and in the test harness implementation, if you want to play along at home!
The memory leak tester tries to only look for leaks incurred during the test. So, before the test code actually runs, it records the “heap” map of allocations:
And then when the test completes, it looks for pointers that weren’t there before:
What’s happening here is that the leak that’s getting reported is something that’s allocated in the “start heap”, but realloced during the test. Because of that, the realloced pointer shows up as leaking, even though if it were never realloced it would be ignored.
If I edit the test harness so that the realloc handler adds the newly-reallocated pointer to
start_heapif the original pointer was instart_heap— telling the leak-checker to ignore the new “version” of the pointer — the test suite passes.In sum, I’d say that this is a bug in the Cryptography test harness.
Digging into this. Uh, far be it from me to try to poke holes in the OpenSSL code, but should https://github.com/openssl/openssl/blob/master/crypto/provider_core.c#L1211 be a write lock rather than a read lock?
AFAIU the cryptography folks are still interested how
ossl_provider_set_operation_bitis initialized, but I think this issue has served its purpose.Thanks everyone for helping finally fix this. 🙃
Also, correct me if I’m wrong @h-vetinari, but I think we’re pretty sure that the issue has something to do with building OpenSSL as a shared library on Windows, because the Cryptography CI works on Windows and that seems to be the biggest difference between our setups, but I don’t think we’ve confirmed that 100%.
Furthermore, there’s a chance that the problem lies in the Cryptography layer rather than OpenSSL per se, if it’s maybe cleaning things up incorrectly. Based on the results of the test suite, which shows a leak of only some provider’s
operation_bitspointer, I think there must be some provider that is allocated and eventually freed inossl_provider_free, but never getsflag_initializedset?Or maybe the test suite has a problem — it’s detecting leaks with an exciting memory tracker that works at the Python level. I could see that maybe getting confused if a pointer address is reused or something.