cloud-sql-jdbc-socket-factory: Cloud SQL IAM service account authentication failed for user
Bug Description
An otherwise valid configuration on occasion will result in Cloud SQL IAM service account authentication failed for user
.
The backend will log: Failed to validate access token
(as opposed to access token expired).
While customers have seen this error in Cloud Run (possibly suggesting a CPU-throttling issue with the background refresh), it also appears on GKE.
Updating to the latest version has not resolved these occasional errors.
Example code (or command)
No response
Stacktrace
No response
Steps to reproduce?
- Deploy an app that logs in with Auto IAM AuthN
- Wait awhile
- Observe occasional failures
Environment
- OS type and version: Linux Container
- Java SDK version: ?
- Cloud SQL Java Socket Factory version: v1.10.0
Additional Details
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 42 (20 by maintainers)
Commits related to this issue
- fix: throw when Auto IAM AuthN is faulty Related to #1174 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by enocom a year ago
- fix: throw when token is expired or empty (#1233) Related to #1174 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by enocom a year ago
- fix: log error when token is invalid Fixes #1174 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by enocom a year ago
- fix: log error when token is invalid (#1313) Fixes #1174 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by enocom a year ago
So quick update – after running on Cloud Run with 100 instances for almost a week, I’m not seeing this error.
Instead of keep guessing, I’m going to make a change to the connector here to throw on empty tokens or expired tokens (the two most likely problems). From there we can further isolate this to a problem in the credentials code or possibly in the backend (although I’ve only heard about this in Java).
v1.11.1 now has a check for an invalid token. I recommend upgrading to see if these occasional errors are caused by an invalid token.
In my case, this is clearly not resource constrained, it was happening more on application which get little traffic
thx @enocom, i dont find the error in the logs anymore 😄
thx a lot seem fixed for me.
I just merged https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/pull/1233 which will throw an exception if an auth token is empty or expired (the two leading hypotheses). That will go out in our next release on Tuesday and I hope it will narrow down the issue here.
we are using
gcr.io/distroless/java17-debian11
We’re using amazoncorretto:17-alpine
I can share that we were seeing this issue~1000 times a day across ~14 Cloud Run instances before disabling CPU throttling in Cloud Run. After disabling CPU throttling, we are still seeing this ~200 times a day. So if you have a hard time reproducing, enable CPU throttling (on the assumption that the product should also reliably manage IAM connections for CPU throttled Cloud Run instances)
Fwiw, given what I have seen CPU load is not necessary to reproduce it. Although maybe it will agitate things. 24 hours is also a bit of a short timeframe. Maybe I can give you a container with its logic stripped out that you can run on a larger pool of nodes?
yes - however in this version we have downgraded to
postgres-socket-factory-1.8.3
. this is the 3rd incident since January and first time since my last message to be clear.5 minutes prior to the exceptions there is a significant increase in logging activity along the lines of:
about 50 of them in total. a handful of requests complete normally interspersed with this logging then all requests begin to fail as the pool is exhausted I suppose. fwiw, the timestamp printed in the above trace roughly matches the timestamp in cloud run logs - same second.
10:39 ET - socket factory cycling begins + first “Cloud SQL IAM service account authentication failed” PG logs 10:45 ET - “timeouts” begin 11:02 ET - full recovery without intervention