cloud-sql-jdbc-socket-factory: `bad_certificate` errors intermittently preventing connection to cloud sql
Bug Description
When using hikari database connection pooling with cloud sql we’re seeing intermittently, usually after a few hours an issue where connections become unavailable with a bad_certificate
error. We see this with mysql and postgres cloud sql connections authenticating with username/password, and differing hikari configurations (most set with the default hikari config of maxLifetime of 30 mins).
Example code (or command)
No response
Stacktrace
....
Caused by: java.sql.SQLTransientConnectionException: HikariPool-3 - Connection is not available, request timed out after 30000ms.
at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:695)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:197)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:162)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
at com.company.config.CloseableDataSourceHikari.getConnection(CloseableDataSourceHikari.java:25)
at org.jdbi.v3.core.Jdbi.open(Jdbi.java:319)
... 18 common frames omitted
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
at org.postgresql.Driver.makeConnection(Driver.java:443)
at org.postgresql.Driver.connect(Driver.java:297)
at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138)
at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:358)
at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:206)
at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:477)
at com.zaxxer.hikari.pool.HikariPool.access$100(HikariPool.java:71)
at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:725)
at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:711)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
... 3 common frames omitted
Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:358)
at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293)
at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:204)
at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:172)
at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1510)
at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1425)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:455)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:426)
at com.google.cloud.sql.core.CoreSocketFactory.createSslSocket(CoreSocketFactory.java:339)
at com.google.cloud.sql.core.CoreSocketFactory.connect(CoreSocketFactory.java:201)
at com.google.cloud.sql.postgres.SocketFactory.createSocket(SocketFactory.java:77)
at org.postgresql.core.PGStream.createSocket(PGStream.java:231)
at org.postgresql.core.PGStream.<init>(PGStream.java:98)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:132)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258)
... 15 common frames omitted
Steps to reproduce?
- Connect to cloud sql either postgres or mysql using username/password (“Allow only SSL connections” does not need to be enabled in cloud sql)
- Let your app run for awhile usually the issue presents itself in a few hours or so. Unable to reproduce myself locally with differing hikari settings.
Environment
- OS type and version: Docker Linux arm64
- Java SDK version: 17
- Cloud SQL Java Socket Factory version: postgres-socket-factory 1.11.2 - jdbi-cloudsql 2.3.5
Additional Details
My initial reaction was it looks similar to the closed issue here https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/issues/472
And is possibly related to this Draft PR.
My colleague has opened an incident with google support already for this issue, but I think it might be worthwhile to open an issue here too. Case 45213975
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 22 (10 by maintainers)
Commits related to this issue
- wip: attempt to reproduce cloud-sql-jdbc-socket-factory #1314 see https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/issues/1314 — committed to GoogleCloudPlatform/cloud-sql-proxy-operator by hessjcg a year ago
- fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
- fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
- fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
- fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
- fix: Increase threadpool count to avoid deadlocks (#1391) To refresh a Cloud SQL Instance's certificates, the current algorithm uses 2 threads from the thread pool for each instance. Because the thr... — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
- fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
- fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
- fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
- fix: Increase threadpool count to avoid deadlock and reduce forceRefresh churn #1314 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
- fix: remove race condition bug in refresh logic (#1390) Update the logic in forceRefresh() to reduce the churn on the thread pool when the certificate refresh API calls are failing. New forceRefre... — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by hessjcg a year ago
- wip: attempt to reproduce cloud-sql-jdbc-socket-factory #1314 see https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/issues/1314 — committed to GoogleCloudPlatform/cloud-sql-proxy-operator by hessjcg a year ago
This has been fixed in v1.13.1. Thank you for all your help debugging this folks.
Hey folks, we found an edge case where the connector when configured with multiple instances would flood the thread pool in such a way that the threads would be unable to progress. We’ve identified a few related improvements in #1391 and #1390. We’ll cut a patch release this week for people affected by this.
After testing, I found that the version in which the problem was introduced is
1.11.2
. In the affected versions (> 1.11.1
) , my (Spring Boot) app successfully connects to the database (using CloudSQL IAM auth) during application start. Within an hour or two, theSSLHandshakeException
/bad_certificate
exception is thrown. Except for a restart, the app is unable to (temporarily) recover from this “broken database connection” state.As mentioned by @Ragge-dev, we also experience the exact same problem. Version
1.11.1
works fineHello @hessjcg. I was just about to open up a new issue when I saw this one, we are seeing this exact issue (same stacktrace, with error starting after app has run for a few hours) on both version
1.11.2
and1.12.0
with version1.11.1
being stable for us. Only difference is that we are using IAM to authenticate towards cloud sql.We are suspecting that this commit in some way changed the behaviour of the certificate refresh logic to cause this error described by @msammarco, at least we can’t find another relevant commit between versions
1.11.1
and1.11.2
.Is this something you are able to look into? I currently have not reproduced it locally, although I could try.
I’ve been running a little Spring Boot app that connects to two databases in one instance in my GKE cluster. My app uses two separate connection pools and has Hikari setup to refresh all connections every minute (with the thought that connection creation will trigger this bug). After half a day, I don’t see the bad certificate yet. I’m going to downgrade this to a p2 to signal that this bug isn’t as pervasive as we original thought. But we’re still working on identifying the root cause.
Yes
We’ll cut a release tomorrow which might fix this, as we’ve made some further improvements to the factoring of our background refresh operations and the logging will be a part of it. Meanwhile, we’ll work on reproducing the issue or confirming it’s been fixed.
Where are you running the Java Connector? Is this all in GKE?