cloud-sql-jdbc-socket-factory: `bad_certificate` errors intermittently preventing connection to cloud sql

Bug Description

When using hikari database connection pooling with cloud sql we’re seeing intermittently, usually after a few hours an issue where connections become unavailable with a bad_certificate error. We see this with mysql and postgres cloud sql connections authenticating with username/password, and differing hikari configurations (most set with the default hikari config of maxLifetime of 30 mins).

Example code (or command)

No response

Stacktrace

....
Caused by: java.sql.SQLTransientConnectionException: HikariPool-3 - Connection is not available, request timed out after 30000ms.
	at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:695)
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:197)
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:162)
	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
	at com.company.config.CloseableDataSourceHikari.getConnection(CloseableDataSourceHikari.java:25)
	at org.jdbi.v3.core.Jdbi.open(Jdbi.java:319)
	... 18 common frames omitted
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
	at org.postgresql.Driver.makeConnection(Driver.java:443)
	at org.postgresql.Driver.connect(Driver.java:297)
	at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138)
	at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:358)
	at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:206)
	at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:477)
	at com.zaxxer.hikari.pool.HikariPool.access$100(HikariPool.java:71)
	at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:725)
	at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:711)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	... 3 common frames omitted
Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
	at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:358)
	at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293)
	at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:204)
	at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:172)
	at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1510)
	at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1425)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:455)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:426)
	at com.google.cloud.sql.core.CoreSocketFactory.createSslSocket(CoreSocketFactory.java:339)
	at com.google.cloud.sql.core.CoreSocketFactory.connect(CoreSocketFactory.java:201)
	at com.google.cloud.sql.postgres.SocketFactory.createSocket(SocketFactory.java:77)
	at org.postgresql.core.PGStream.createSocket(PGStream.java:231)
	at org.postgresql.core.PGStream.<init>(PGStream.java:98)
	at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
	... 15 common frames omitted

Steps to reproduce?

Connect to cloud sql either postgres or mysql using username/password (“Allow only SSL connections” does not need to be enabled in cloud sql)
Let your app run for awhile usually the issue presents itself in a few hours or so. Unable to reproduce myself locally with differing hikari settings.

Environment

OS type and version: Docker Linux arm64
Java SDK version: 17
Cloud SQL Java Socket Factory version: postgres-socket-factory 1.11.2 - jdbi-cloudsql 2.3.5

Additional Details

My initial reaction was it looks similar to the closed issue here https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/issues/472

And is possibly related to this Draft PR.

My colleague has opened an incident with google support already for this issue, but I think it might be worthwhile to open an issue here too. Case 45213975

About this issue

Original URL
State: closed
Created a year ago
Comments: 22 (10 by maintainers)

Commits related to this issue

wip: attempt to reproduce cloud-sql-jdbc-socket-factory #1314 see https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/issues/1314 — committed to GoogleCloudPlatform/cloud-sql-proxy-operator by hessjcg a year ago
fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
fix: Increase threadpool count to avoid deadlocks (#1391) To refresh a Cloud SQL Instance's certificates, the current algorithm uses 2 threads from the thread pool for each instance. Because the thr... — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
fix: remove race condition bug in refresh logic (#1390) Update the logic in forceRefresh() to reduce the churn on the thread pool when the certificate refresh API calls are failing. New forceRefre... — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
wip: attempt to reproduce cloud-sql-jdbc-socket-factory #1314 see https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/issues/1314 — committed to GoogleCloudPlatform/cloud-sql-proxy-operator by hessjcg a year ago

Most upvoted comments

This has been fixed in v1.13.1. Thank you for all your help debugging this folks.

enocom on Jul 21, 2023

Hey folks, we found an edge case where the connector when configured with multiple instances would flood the thread pool in such a way that the threads would be unable to progress. We’ve identified a few related improvements in #1391 and #1390. We’ll cut a patch release this week for people affected by this.

enocom on Jul 18, 2023

After testing, I found that the version in which the problem was introduced is 1.11.2. In the affected versions (> 1.11.1) , my (Spring Boot) app successfully connects to the database (using CloudSQL IAM auth) during application start. Within an hour or two, the SSLHandshakeException/bad_certificate exception is thrown. Except for a restart, the app is unable to (temporarily) recover from this “broken database connection” state.

tomirio619 on Jul 7, 2023

As mentioned by @Ragge-dev, we also experience the exact same problem. Version 1.11.1 works fine

tomirio619 on Jul 4, 2023

Hello @hessjcg. I was just about to open up a new issue when I saw this one, we are seeing this exact issue (same stacktrace, with error starting after app has run for a few hours) on both version 1.11.2 and 1.12.0 with version 1.11.1 being stable for us. Only difference is that we are using IAM to authenticate towards cloud sql.

We are suspecting that this commit in some way changed the behaviour of the certificate refresh logic to cause this error described by @msammarco, at least we can’t find another relevant commit between versions 1.11.1 and 1.11.2.

Is this something you are able to look into? I currently have not reproduced it locally, although I could try.

Ragge-dev on Jul 3, 2023

I’ve been running a little Spring Boot app that connects to two databases in one instance in my GKE cluster. My app uses two separate connection pools and has Hikari setup to refresh all connections every minute (with the thought that connection creation will trigger this bug). After half a day, I don’t see the bad certificate yet. I’m going to downgrade this to a p2 to signal that this bug isn’t as pervasive as we original thought. But we’re still working on identifying the root cause.

enocom on Jul 14, 2023

Where are you running the Java Connector? Is this all in GKE?

Yes

tomirio619 on Jul 11, 2023

We’ll cut a release tomorrow which might fix this, as we’ve made some further improvements to the factoring of our background refresh operations and the logging will be a part of it. Meanwhile, we’ll work on reproducing the issue or confirming it’s been fixed.

enocom on Jul 10, 2023

Where are you running the Java Connector? Is this all in GKE?

enocom on Jul 10, 2023