cloud-sql-jdbc-socket-factory: Cloud SQL IAM service account authentication failed for user

Bug Description

An otherwise valid configuration on occasion will result in Cloud SQL IAM service account authentication failed for user.

The backend will log: Failed to validate access token (as opposed to access token expired).

While customers have seen this error in Cloud Run (possibly suggesting a CPU-throttling issue with the background refresh), it also appears on GKE.

Updating to the latest version has not resolved these occasional errors.

Example code (or command)

No response

Stacktrace

No response

Steps to reproduce?

  1. Deploy an app that logs in with Auto IAM AuthN
  2. Wait awhile
  3. Observe occasional failures

Environment

  1. OS type and version: Linux Container
  2. Java SDK version: ?
  3. Cloud SQL Java Socket Factory version: v1.10.0

Additional Details

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 42 (20 by maintainers)

Commits related to this issue

Most upvoted comments

So quick update – after running on Cloud Run with 100 instances for almost a week, I’m not seeing this error.

Instead of keep guessing, I’m going to make a change to the connector here to throw on empty tokens or expired tokens (the two most likely problems). From there we can further isolate this to a problem in the credentials code or possibly in the backend (although I’ve only heard about this in Java).

v1.11.1 now has a check for an invalid token. I recommend upgrading to see if these occasional errors are caused by an invalid token.

In my case, this is clearly not resource constrained, it was happening more on application which get little traffic

thx @enocom, i dont find the error in the logs anymore 😄

thx a lot seem fixed for me.

I just merged https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/pull/1233 which will throw an exception if an auth token is empty or expired (the two leading hypotheses). That will go out in our next release on Tuesday and I hope it will narrow down the issue here.

we are using gcr.io/distroless/java17-debian11

In that case, I’m going to adjust my approach and restore the original connector and just run that for a few days.

After running the debug app in Cloud Run and GKE for ~week or so, I haven’t seen a single occurrence of this error.

Meanwhile, if folks could let me know what base container image they’re using, that might help.

We’re using amazoncorretto:17-alpine

I can share that we were seeing this issue~1000 times a day across ~14 Cloud Run instances before disabling CPU throttling in Cloud Run. After disabling CPU throttling, we are still seeing this ~200 times a day. So if you have a hard time reproducing, enable CPU throttling (on the assumption that the product should also reliably manage IAM connections for CPU throttled Cloud Run instances)

Fwiw, given what I have seen CPU load is not necessary to reproduce it. Although maybe it will agitate things. 24 hours is also a bit of a short timeframe. Maybe I can give you a container with its logic stripped out that you can run on a larger pool of nodes?

yes - however in this version we have downgraded to postgres-socket-factory-1.8.3. this is the 3rd incident since January and first time since my last message to be clear.

5 minutes prior to the exceptions there is a significant increase in logging activity along the lines of:

Mar 15, 2023 2:39:34 PM com.google.cloud.sql.core.CoreSocketFactory connect
INFO: Connecting to Cloud SQL instance [xxxx] via SSL socket.  

about 50 of them in total. a handful of requests complete normally interspersed with this logging then all requests begin to fail as the pool is exhausted I suppose. fwiw, the timestamp printed in the above trace roughly matches the timestamp in cloud run logs - same second.

10:39 ET - socket factory cycling begins + first “Cloud SQL IAM service account authentication failed” PG logs 10:45 ET - “timeouts” begin 11:02 ET - full recovery without intervention