openssl: Memory leaks with shared windows builds

I’ve been helping maintain openssl & cryptography in the conda-forge ecosystem, and one of our biggest stumbling blocks in that regard has been upgrading to OpenSSL 3. In large part that’s due to me not being confident enough to raise an issue because I didn’t feel I understood the failures well enough at the time.

After sorting out most build errors, we’ve been stuck that the very thorough cryptography test suite notices memory leaks (only) on windows. At the time I opened https://github.com/pyca/cryptography/issues/7379, but upstream cryptography cannot really help with debugging this due to infrastructural reasons (they compile against a prepared static build, we need to build against shared libs).

Even more curiously (and another reason that kept me from opening this issue here) is that the failures only show up for CPython 3.7, 3.9 & 3.10, but not for CPython 3.8 or PyPy 3.8, 3.9. This makes it look like some sort of interaction with the Python runtime is happening, but truth be told, I don’t know. Maybe the memory leaks are always there and being hidden through other interferences?

In any case, I wanted to kindly ask for help as this state of affairs keeps us in an uncomfortable in-between-state - essentially we’re halfway through a so-called “migration” across all our packages, currently building everything for 1.1.1 & 3.0.0, until the remaining packages (all dependents of cryptography) have finished moving to 3.0.0, and we can then drop the 1.1.1-flavoured builds.

The windows builds script that we’re using is here, config step reproduced here for convenience:

:: Configure step
perl configure VC-WIN64A          ^
    --prefix=%LIBRARY_PREFIX%     ^
    --openssldir=%LIBRARY_PREFIX% ^
    enable-legacy                 ^
    no-fips                       ^
    no-module                     ^
    no-zlib                       ^
    shared

The failures can then be seen in e.g. https://github.com/conda-forge/cryptography-feedstock/pull/106 (previously https://github.com/conda-forge/cryptography-feedstock/pull/80), with a corresponding CI run here (please ignore the failures on linux-aarch64 & linux-ppc64le, those are unrelated cross-compilation woes on the Rust side).

The failures all point to a memory leak occurring at crypto/provider_core.c:1616, though not all tests from cryptography/tests/hazmat/backends/test_openssl_memleak.py are failing, only:

=========================== short test summary info ===========================
FAILED tests/hazmat/backends/test_openssl_memleak.py::TestOpenSSLMemoryLeaks::test_x509_csr_extensions
FAILED tests/hazmat/backends/test_openssl_memleak.py::TestOpenSSLMemoryLeaks::test_x25519_pubkey_from_private_key
FAILED tests/hazmat/backends/test_openssl_memleak.py::TestOpenSSLMemoryLeaks::test_load_pkcs12_key_and_certificates[pkcs12/cert-aes256cbc-no-key.p12]
FAILED tests/hazmat/backends/test_openssl_memleak.py::TestOpenSSLMemoryLeaks::test_load_pkcs12_key_and_certificates[pkcs12/cert-key-aes256cbc.p12]
FAILED tests/hazmat/backends/test_openssl_memleak.py::TestOpenSSLMemoryLeaks::test_write_pkcs12_key_and_certificates
= 5 failed, 2816 passed, 56 skipped, 296 warnings, 85319 subtests passed in 340.54s (0:05:40) =

Happy to provide further information (or test patches on our infra), but I’ve reached the end of the line here, and would be very grateful for some help. 😅

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 16 (10 by maintainers)

Most upvoted comments

Oh! I figured out what’s going on. All of the info you need is up above and in the test harness implementation, if you want to play along at home!

The memory leak tester tries to only look for leaks incurred during the test. So, before the test code actually runs, it records the “heap” map of allocations:

# heap is a hashtable of `ptr => allocation info`
start_heap = set(heap)

And then when the test completes, it looks for pointers that weren’t there before:

remaining = set(heap) - start_heap

if remaining:
    # fail

What’s happening here is that the leak that’s getting reported is something that’s allocated in the “start heap”, but realloced during the test. Because of that, the realloced pointer shows up as leaking, even though if it were never realloced it would be ignored.

If I edit the test harness so that the realloc handler adds the newly-reallocated pointer to start_heap if the original pointer was in start_heap — telling the leak-checker to ignore the new “version” of the pointer — the test suite passes.

In sum, I’d say that this is a bug in the Cryptography test harness.

pkgw on Oct 21, 2022

Digging into this. Uh, far be it from me to try to poke holes in the OpenSSL code, but should https://github.com/openssl/openssl/blob/master/crypto/provider_core.c#L1211 be a write lock rather than a read lock?

pkgw on Oct 21, 2022

AFAIU the cryptography folks are still interested how ossl_provider_set_operation_bit is initialized, but I think this issue has served its purpose.

Thanks everyone for helping finally fix this. 🙃

h-vetinari on Oct 23, 2022

Also, correct me if I’m wrong @h-vetinari, but I think we’re pretty sure that the issue has something to do with building OpenSSL as a shared library on Windows, because the Cryptography CI works on Windows and that seems to be the biggest difference between our setups, but I don’t think we’ve confirmed that 100%.

Furthermore, there’s a chance that the problem lies in the Cryptography layer rather than OpenSSL per se, if it’s maybe cleaning things up incorrectly. Based on the results of the test suite, which shows a leak of only some provider’s operation_bits pointer, I think there must be some provider that is allocated and eventually freed in ossl_provider_free, but never gets flag_initialized set?

Or maybe the test suite has a problem — it’s detecting leaks with an exciting memory tracker that works at the Python level. I could see that maybe getting confused if a pointer address is reused or something.

pkgw on Oct 21, 2022