google-auth-library-nodejs: Sometimes failing to retrieve Application Default Credentials on GKE Autopilot in googleapis auth library
Mirror question here in Stack Overflow: https://stackoverflow.com/questions/68854640/sometimes-failing-to-retrieve-application-default-credentials-on-gke-autopilot-i
Some pods in my GKE Autopilot cluster aren’t able to grab the Application Default Credentials to call other GCP services.
I will apply a new deployment, and 1 or 2 out of the 3 pods won’t be able to authenticate using the googleapis (google-auth-library) npm package (tried with version v73.0.0 and the latest v84.0.0).
I get:
Error: Could not load the default credentials. Browse to https://cloud.google.com/docs/authentication/getting-started for more information. at GoogleAuth.getApplicationDefaultAsync (/node_modules/google-auth-library/build/src/auth/googleauth.js:173:19)
I am using this code and retrying on failure:
const {google} = require('googleapis');
const setGoogleAuth = async () => {
try {
const auth = new google.auth.GoogleAuth({
// Scopes can be specified either as an array or as a single, space-delimited string.
scopes: ['https://www.googleapis.com/auth/cloud-platform'],
});
// Acquire an auth client, and bind it to all future calls
const authClient = await auth.getClient();
google.options({auth: authClient});
} catch (e) {
console.error(e)
//retry
//sleep for 3 seconds
await sleep(3000)
await setGoogleAuth()
}
}
Calling the metadata server manually from a troubled pod via curl --location --request POST 'http://metadata/computeMetadata/v1/instance/service-accounts/default/identity?audience=my-gcp-endpoint' \ --header 'Metadata-Flavor: Google' \ --data-raw '{}'
returns a valid token
Sometimes killing the pod and having them recreated works (using minReplicas in Horizontal Pod Autoscaler) and sometimes no matter how many times I kill the troubled pods the issue persists. Other times, I’ll redeploy and have no problems. The behaviour seems very non-deterministic.
I also tried running another node process inside the troubled pod, and did not get the error
Any help would be appreciated, thank you!
Possible related issues
https://github.com/googleapis/google-auth-library-nodejs/issues/798 https://github.com/googleapis/google-auth-library-nodejs/issues/786 https://github.com/googleapis/google-auth-library-nodejs/issues/526
Environment details
- GKE Version: 1.20.8-gke.900
- Node.js version: v10.24.1
- npm version: 6.14.12
google-auth-library
version: tried with v73.0.0 and v84.0.0 (latest)
Steps to reproduce
- Difficult to reproduce as behaviour is non deterministic
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 16 (6 by maintainers)
@mikegin could you try setting
DETECT_GCP_RETRIES=3
, and see if this stops the behavior? This will hopefully give GKE Autopilot more time to initialize.We’ve merged a fix! Happy to share this is fixed as of
gcp-metadata
v5.1.0, which should be automatically picked up by future installations of this library.Want to give an update to this issue - this PR should resolve this issue, however we’re syncing internally on its implementation and uniformity across languages: https://github.com/googleapis/gcp-metadata/pull/528
Thanks @bcoe. I was facing the same issue with a deployment in a standard GKE cluster. Setting
DETECT_GCP_RETRIES=3
did work, but it would be great to have this documented somewhere easier to find, maybe in the GKE Workload Identity troubleshooting section?Adding a “Troubleshooting” section in the README with the issue and solution would be good. Even mentioning the
DEBUG_AUTH
would be great, as it helped with further diagnosis.We also hit this today and spent quite a while chasing shadows before eventually finding this thread.
While documentation is good and an improvement…
As a Google Cloud customer, I would like the Google Cloud SDKs to work on Google Cloud by default. It is not ideal to have to find somewhat obscure env vars hidden in the READMEs of internal SDK libraries in order to make things work reliably.
It seems like there is already handling to speed up the metadata checks for those who are running the SDK outside of GCP, e.g. the
failFast
here. Perhaps defaulting to more tries would be OK?Also, there are also some (apparently) GCE-specific checks that control the
failFast
behaviour, here, but from a quick test, the env vars they use don’t seem to be present for GKE workloads (i.e. inside a pod). Maybe those could be extended in some way to detect GKE as well as GCE?