cloud-sql-jdbc-socket-factory: Cloud SQL IAM service account authentication failed for user

Bug Description

An otherwise valid configuration on occasion will result in Cloud SQL IAM service account authentication failed for user.

The backend will log: Failed to validate access token (as opposed to access token expired).

While customers have seen this error in Cloud Run (possibly suggesting a CPU-throttling issue with the background refresh), it also appears on GKE.

Updating to the latest version has not resolved these occasional errors.

Example code (or command)

No response

Stacktrace

No response

Steps to reproduce?

Deploy an app that logs in with Auto IAM AuthN
Wait awhile
Observe occasional failures

Environment

OS type and version: Linux Container
Java SDK version: ?
Cloud SQL Java Socket Factory version: v1.10.0

Additional Details

No response

About this issue

Original URL
State: closed
Created a year ago
Reactions: 2
Comments: 42 (20 by maintainers)

Commits related to this issue

fix: throw when Auto IAM AuthN is faulty Related to #1174 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by enocom a year ago
fix: throw when token is expired or empty (#1233) Related to #1174 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by enocom a year ago
fix: log error when token is invalid Fixes #1174 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by enocom a year ago
fix: log error when token is invalid (#1313) Fixes #1174 — committed to GoogleCloudPlatform/cloud-sql-jdbc-socket-factory by enocom a year ago

Most upvoted comments

So quick update – after running on Cloud Run with 100 instances for almost a week, I’m not seeing this error.

Instead of keep guessing, I’m going to make a change to the connector here to throw on empty tokens or expired tokens (the two most likely problems). From there we can further isolate this to a problem in the credentials code or possibly in the backend (although I’ve only heard about this in Java).

enocom on Mar 29, 2023

v1.11.1 now has a check for an invalid token. I recommend upgrading to see if these occasional errors are caused by an invalid token.

enocom on Apr 11, 2023

In my case, this is clearly not resource constrained, it was happening more on application which get little traffic

vr on Mar 22, 2023

thx @enocom, i dont find the error in the logs anymore 😄

thx a lot seem fixed for me.

hanfi on Apr 18, 2023

I just merged https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/pull/1233 which will throw an exception if an auth token is empty or expired (the two leading hypotheses). That will go out in our next release on Tuesday and I hope it will narrow down the issue here.

enocom on Apr 4, 2023

we are using gcr.io/distroless/java17-debian11

vr on Mar 28, 2023

In that case, I’m going to adjust my approach and restore the original connector and just run that for a few days.

After running the debug app in Cloud Run and GKE for ~week or so, I haven’t seen a single occurrence of this error.

Meanwhile, if folks could let me know what base container image they’re using, that might help.

We’re using amazoncorretto:17-alpine

sscheible on Mar 27, 2023

I can share that we were seeing this issue~1000 times a day across ~14 Cloud Run instances before disabling CPU throttling in Cloud Run. After disabling CPU throttling, we are still seeing this ~200 times a day. So if you have a hard time reproducing, enable CPU throttling (on the assumption that the product should also reliably manage IAM connections for CPU throttled Cloud Run instances)

sscheible on Mar 17, 2023

Fwiw, given what I have seen CPU load is not necessary to reproduce it. Although maybe it will agitate things. 24 hours is also a bit of a short timeframe. Maybe I can give you a container with its logic stripped out that you can run on a larger pool of nodes?

michae1T on Mar 16, 2023

yes - however in this version we have downgraded to postgres-socket-factory-1.8.3. this is the 3rd incident since January and first time since my last message to be clear.

5 minutes prior to the exceptions there is a significant increase in logging activity along the lines of:

Mar 15, 2023 2:39:34 PM com.google.cloud.sql.core.CoreSocketFactory connect
INFO: Connecting to Cloud SQL instance [xxxx] via SSL socket.

about 50 of them in total. a handful of requests complete normally interspersed with this logging then all requests begin to fail as the pool is exhausted I suppose. fwiw, the timestamp printed in the above trace roughly matches the timestamp in cloud run logs - same second.

10:39 ET - socket factory cycling begins + first “Cloud SQL IAM service account authentication failed” PG logs 10:45 ET - “timeouts” begin 11:02 ET - full recovery without intervention

michae1T on Mar 15, 2023