cloud-sql-jdbc-socket-factory: `bad_certificate` errors intermittently preventing connection to cloud sql

Bug Description

When using hikari database connection pooling with cloud sql we’re seeing intermittently, usually after a few hours an issue where connections become unavailable with a bad_certificate error. We see this with mysql and postgres cloud sql connections authenticating with username/password, and differing hikari configurations (most set with the default hikari config of maxLifetime of 30 mins).

Example code (or command)

No response

Stacktrace

....
Caused by: java.sql.SQLTransientConnectionException: HikariPool-3 - Connection is not available, request timed out after 30000ms.
	at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:695)
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:197)
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:162)
	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
	at com.company.config.CloseableDataSourceHikari.getConnection(CloseableDataSourceHikari.java:25)
	at org.jdbi.v3.core.Jdbi.open(Jdbi.java:319)
	... 18 common frames omitted
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
	at org.postgresql.Driver.makeConnection(Driver.java:443)
	at org.postgresql.Driver.connect(Driver.java:297)
	at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138)
	at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:358)
	at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:206)
	at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:477)
	at com.zaxxer.hikari.pool.HikariPool.access$100(HikariPool.java:71)
	at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:725)
	at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:711)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	... 3 common frames omitted
Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
	at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:358)
	at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293)
	at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:204)
	at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:172)
	at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1510)
	at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1425)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:455)
	at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:426)
	at com.google.cloud.sql.core.CoreSocketFactory.createSslSocket(CoreSocketFactory.java:339)
	at com.google.cloud.sql.core.CoreSocketFactory.connect(CoreSocketFactory.java:201)
	at com.google.cloud.sql.postgres.SocketFactory.createSocket(SocketFactory.java:77)
	at org.postgresql.core.PGStream.createSocket(PGStream.java:231)
	at org.postgresql.core.PGStream.<init>(PGStream.java:98)
	at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
	... 15 common frames omitted

Steps to reproduce?

  1. Connect to cloud sql either postgres or mysql using username/password (“Allow only SSL connections” does not need to be enabled in cloud sql)
  2. Let your app run for awhile usually the issue presents itself in a few hours or so. Unable to reproduce myself locally with differing hikari settings.

Environment

  1. OS type and version: Docker Linux arm64
  2. Java SDK version: 17
  3. Cloud SQL Java Socket Factory version: postgres-socket-factory 1.11.2 - jdbi-cloudsql 2.3.5

Additional Details

My initial reaction was it looks similar to the closed issue here https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/issues/472

And is possibly related to this Draft PR.

My colleague has opened an incident with google support already for this issue, but I think it might be worthwhile to open an issue here too. Case 45213975

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (10 by maintainers)

Commits related to this issue

Most upvoted comments

This has been fixed in v1.13.1. Thank you for all your help debugging this folks.

Hey folks, we found an edge case where the connector when configured with multiple instances would flood the thread pool in such a way that the threads would be unable to progress. We’ve identified a few related improvements in #1391 and #1390. We’ll cut a patch release this week for people affected by this.

After testing, I found that the version in which the problem was introduced is 1.11.2. In the affected versions (> 1.11.1) , my (Spring Boot) app successfully connects to the database (using CloudSQL IAM auth) during application start. Within an hour or two, the SSLHandshakeException/bad_certificate exception is thrown. Except for a restart, the app is unable to (temporarily) recover from this “broken database connection” state.

As mentioned by @Ragge-dev, we also experience the exact same problem. Version 1.11.1 works fine

Hello @hessjcg. I was just about to open up a new issue when I saw this one, we are seeing this exact issue (same stacktrace, with error starting after app has run for a few hours) on both version 1.11.2 and 1.12.0 with version 1.11.1 being stable for us. Only difference is that we are using IAM to authenticate towards cloud sql.

We are suspecting that this commit in some way changed the behaviour of the certificate refresh logic to cause this error described by @msammarco, at least we can’t find another relevant commit between versions 1.11.1 and 1.11.2.

Is this something you are able to look into? I currently have not reproduced it locally, although I could try.

I’ve been running a little Spring Boot app that connects to two databases in one instance in my GKE cluster. My app uses two separate connection pools and has Hikari setup to refresh all connections every minute (with the thought that connection creation will trigger this bug). After half a day, I don’t see the bad certificate yet. I’m going to downgrade this to a p2 to signal that this bug isn’t as pervasive as we original thought. But we’re still working on identifying the root cause.

Where are you running the Java Connector? Is this all in GKE?

Yes

We’ll cut a release tomorrow which might fix this, as we’ve made some further improvements to the factoring of our background refresh operations and the logging will be a part of it. Meanwhile, we’ll work on reproducing the issue or confirming it’s been fixed.

Where are you running the Java Connector? Is this all in GKE?