conscrypt: Severe memory leak when setUseEngineSocketByDefault(true)
I’ve included below the source code of a simple test that creates 100k localhost SSL connections.
If it is run without conscrypt, it uses approx 330MB of Linux res mem. With conscrypt, it uses approx 3.7GB res mem, despite Xmx being set to 512M.
I’ve tested this both on Ubuntu 18 and Centos 7, on two different machines. One machine is a Ubuntu VM with 2 virtual cores on an Apple laptop with 6 cores, and the other machine has dual Xeons, each with 8 cores. Both tests were with OpenJDK 14.0.1 as downloaded from https://jdk.java.net/14/ and with conscrypt-openjdk-2.4.0-linux-x86_64 from the maven repo.
(Note that this memory leak happens when testing with both conscrypt 2.4.0 and 2.2.1.)
To make the code run, a keystore must first be created using the command: keytool -genkey -keyalg EC -keystore keystore -groupname secp256r1 -alias localhost -keypass password -storepass password -dname "CN=localhost,OU=X,O=X,L=X,S=X,C=X"
The conscypt jar has to be available, which can be retrieved with wget https://repo1.maven.org/maven2/org/conscrypt/conscrypt-openjdk/2.4.0/conscrypt-openjdk-2.4.0-linux-x86_64.jar
The code is compiled using javac Main.java -cp conscrypt-openjdk-2.4.0-linux-x86_64.jar
The code is run using java -Xmx512M -cp conscrypt-openjdk-2.4.0-linux-x86_64.jar:. -Djavax.net.ssl.keyStore=keystore -Djavax.net.ssl.keyStorePassword=password -Djavax.net.ssl.trustStore=keystore -Djavax.net.ssl.trustStorePassword=password Main
The code, to be placed inside Main.java:
import org.conscrypt.Conscrypt;
import org.conscrypt.OpenSSLProvider;
import java.io.*;
import javax.net.*;
import javax.net.ssl.*;
import java.net.*;
import java.security.Security;
import java.util.Scanner;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
public class Main {
public static void main(String[] args) throws Exception {
System.out.println("pid: " + ProcessHandle.current().pid());
System.out.println("Use conscrypt? (y/n)");
boolean useConscrypt = new Scanner(System.in).nextLine().toLowerCase().startsWith("y");
if(useConscrypt) {
Conscrypt.setUseEngineSocketByDefault(true);
Security.insertProviderAt(new OpenSSLProvider(), 1);
}
int listenPort = 12345;
ServerSocketFactory serverSocketFactory = SSLServerSocketFactory.getDefault();
ServerSocket serverSocket = serverSocketFactory.createServerSocket(listenPort);
new Thread(()->{
try {
while(true) {
try(var socket = serverSocket.accept()) {
try(var out = new PrintWriter(new BufferedWriter(new OutputStreamWriter(socket.getOutputStream())))) {
out.println("Hello");
out.flush();
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}).start();
SSLSocketFactory factory = (SSLSocketFactory) SSLSocketFactory.getDefault();
int concurrentThreads = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(concurrentThreads);
var count = new AtomicInteger(0);
var totalConnections = 100*1000;
new Thread(()->{
try {
while(true) {
Thread.sleep(1000);
System.out.println("Count: " + count.get());
}
}
catch (InterruptedException e) {}
}).start();
for(int i=0; i<totalConnections; i++) {
executor.submit(() -> {
try {
SSLSocket socket = (SSLSocket) factory.createSocket("localhost", listenPort);
socket.startHandshake();
BufferedReader in = new BufferedReader(new InputStreamReader(socket.getInputStream()));
while ((in.readLine()) != null);
in.close();
socket.close();
count.incrementAndGet();
}
catch (IOException e) {
e.printStackTrace();
}
});
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
}
}
Thanks in advance for any assistance.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 3
- Comments: 52 (14 by maintainers)
Commits related to this issue
- Fix ConscryptEngineSocket.close(). Previousy, this method closed the underlying socket first and then the SSLEngine. However closing a connected SSLEngine queues a TLS close notification which obvio... — committed to prbprbprb/conscrypt by prbprbprb 4 years ago
- Fix ConscryptEngineSocket.close(). Previousy, this method closed the underlying socket first and then the SSLEngine. However closing a connected SSLEngine queues a TLS close notification which obvio... — committed to prbprbprb/conscrypt by prbprbprb 4 years ago
- Fix ConscryptEngineSocket.close(). Previousy, this method closed the underlying socket first and then the SSLEngine. However closing a connected SSLEngine queues a TLS close notification which obvio... — committed to prbprbprb/conscrypt by prbprbprb 4 years ago
- Fix ConscryptEngineSocket.close(). Previousy, this method closed the underlying socket first and then the SSLEngine. However closing a connected SSLEngine queues a TLS close notification which obvio... — committed to prbprbprb/conscrypt by prbprbprb 4 years ago
- Fix ConscryptEngineSocket.close(). Previousy, this method closed the underlying socket first and then the SSLEngine. However closing a connected SSLEngine queues a TLS close notification which obvio... — committed to prbprbprb/conscrypt by prbprbprb 4 years ago
- Fix ConscryptEngineSocket.close(). Previousy, this method closed the underlying socket first and then the SSLEngine. However closing a connected SSLEngine queues a TLS close notification which obvio... — committed to prbprbprb/conscrypt by prbprbprb 4 years ago
- Fix ConscryptEngineSocket.close(). Previoly, this method closed the underlying socket first and then the SSLEngine. However closing a connected SSLEngine queues a TLS close notification which obviou... — committed to prbprbprb/conscrypt by prbprbprb 4 years ago
- Ensure networkBio is closed in ConscryptEngine Since BioWrapper does not have a finalizer, we need to ensure that the native resources are freed when it goes away. In all paths, "networkBio.close()" ... — committed to kruton/conscrypt by kruton 4 years ago
- Ensure networkBio is closed in ConscryptEngine Since BioWrapper does not have a finalizer, we need to ensure that the native resources are freed when it goes away. In all paths, "networkBio.close()" ... — committed to kruton/conscrypt by kruton 4 years ago
Sorry for the late join in.
We also noticed this happening in production. Our services send only out going connections and usually the memory usage spikes when the targets timeout the http request or fail to connect in general.
While testing I also noticed a weird trend which might be or might not be related, that the memory usage and especially the build up of “leaked memory” is much lower on Oracle JDK opposed to AdoptOpenJDK (Both 11).
This could be a coincidence but this is a pattern we observed where the AdoptOpenJDK instances would need to be restarted or get OOM killed every 10-15hours where as the JVM ones would be able to run for a couple of days without much trouble (still leaking memory ofc)
I have setup a proper test environment with as much similar test conditions as possible and will report back if that wasn’t just a fluke in our observations
@yschimke
Yes, because if I simply remove Conscrypt as an OpenSSLProvider on the server, there is no longer a memory leak. Additionally, you will see earlier in this thread that I’ve used jemalloc to demonstrate that the leak is as a result of calls to native code (OPENSSL_malloc) rather than Java object leaks.
FYI my real-world code doesn’t look anything like Main11. Instead, it retrieves images from a list of specified web pages. That code uses Apache HTTPClient, and in turn will use Conscrypt if it is registered as a provider. It has Xmx set to 3GB, so garbage collection should have kicked in long before it exceeded 10GB of res mem.
It’s possible that the leak happens due to unexpected connection issues, or a variety of uncommon circumstances that may only show themselves in the real world. It’s annoying that the leak is so slow, which makes it much harder to iteratively narrow down a list.
Nice work. I’d focused on real world scenarios which mostly had successful connections.
This is a chart of hourly recorded memory usage of a Jetty web server over a period of 470 hours, starting from the time when the Jetty process was first launched. The JVM Xmx is set to 3GB, and it’s running on Ubuntu 20.04 LTS.
The web server uses Conscrypt both for incoming and outgoing connections, including using Conscrypt for outgoing TLS JDBC connections (by installing Conscrypt as the default security provider).
It is using the CI build based on commit 52f3cf1 (January 28, 2021).
You can see that suddenly, in the last few hours, there was a 2.6 GB spike in res mem usage (from about 8.3GB to 10.9GB).
This coincided with a database upgrade, which involved the DB being shut down twice (each time for about 10 minutes). This would have meant that new JDBC connections via Conscrypt could not have been made to the DB.
Logs show that approx. 1700
Socket fail to connect to host:[redacted].com, port:[redacted]. Connection refusedexceptions occurred during the DB downtime. The JDBC connection pool reported a few dozenBroken pipe/Connection is closed/Connection was killedexceptions.It’s possible that the leak is due to existing outgoing JDBC connections being suddenly dropped, or due to new outgoing JDBC connections failing to be established.
It’s also possible that the leak was caused by the server accepting lots of incoming HTTPS requests which were blocking due to the JDBC connection pool locking up as it waited to re-establish connections to the DB.
Note that several hours have passed since the DB downtime happened, and although the Jetty server has gone back to normal, and is able to serve HTTPS connections and communicate with the DB via JDBC, its res mem footprint has stayed permanently high.
@yschimke Thanks for looking into this. I could not replicate the memory leak with a simple loopback test, which is why I wrote Main11.java. Main11 will start at wikipedia and follow links outward to as many different domains as possible, so that it encounters a wide variety of TLS implementations/configurations.
Note that it will take few hours for the memory leak to show. My server that uses conscrypt to make outbound connections leaks about 10GB a week (so I just restart the process once a week, and that’s a workable solution).
The server socket leak was fixed by https://github.com/google/conscrypt/pull/893 which is not yet merged, as far as I can see.
I’m very happy to deploy https://github.com/google/conscrypt/pull/942 to my server and leave it running for a week, to see if I get the leak. I just need to know which boringssl commit to use when building (I use the build method I documented here: https://github.com/google/conscrypt/issues/835#issuecomment-692451664 )
Yeah, I realized that I was redirecting stderr so I missed some problems my initial patch caused. It would try to double-close the BIO sometimes. It also uncovered trying to use the
sslwithout locking it. I think it’s ready to try out now.Also would be worth setting
-Dorg.conscrypt.native.deleteLibAfterLoading=falseto help with symbol fetching.Just a thought: From looking at the conscrypt code, I think this memory leak could be due to BioWrapper.close() not always being called. And perhaps the reason this leak only shows up in production is because only in production is it the case that SSL connections are not terminated cleanly, thus not causing BioWrapper.close() to always be called?
Maybe the problem is that it is sometimes the case that
pendingInboundCleartextBytes() != 0, on isInboundDone() orpendingOutboundEncryptedBytes() != 0on isOutboundDone(), thus preventing memory from being freed.I notice that there is a drainOutgoingQueue() attempt followed by a reattempt to close and free the socket, but there is no equivalent to attempt to clear the pendingInboundCleartextBytes and then retry the closing and freeing of the socket.
Maybe the answer is to complete the
// TODO: Close SSL sockets using a background thread so they close gracefully.mentioned here.Aha, some googling has turned up the following:
Notice in the hex, there is a sequence
C0 2B C0 2C CC A9 C0 2F C0 30 CC A8 C0 09 C0 0A C0 13 C0 14 00 9C 00 9D 00 2F 00 35, which corresponds to what looks like+.,.../.0.............../.5in ascii.These are TLS cipher suites. See e.g. https://security.stackexchange.com/questions/166556/ssl-tls-cipher-suites-order-for-windows-2016-hosted-https-sites and https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.2.0/com.ibm.zos.v2r2.gska100/csdcwh.htm
So it looks like the 100k repetition in memory is the contents of a buffer used for the TLS handshake. This solves the question of why
30 39(the hex of the port number 12345) is not part of the data - “localhost” is specified in the handshake for SNI, and the port number is not necessary for that purpose.Note that this is part of the
ClientHellofrom the client. I’m not sure, but I don’t think there are 100k copies of theServerHelloin memory, since I searched for the certificate serial number sent by the server, and only see that a few hundred times in the data. It’s possible though that some parts of the server reponse are part of the memory leak, and other parts (like the certificate number) aren’t part of the leak.