google-api-dotnet-client: Dotnet - Unhandled exception. System.InvalidOperationException: The Application Default Credentials are not available - the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials

INTRO

  • We’ve got troubles in our workloads fetching the token from the “metadata-server” in GKE (It works randomly). There are scenarios where the application can fetch the token the first time and others after 2 or 3 restarts.
Unhandled exception. System.InvalidOperationException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
   at Google.Apis.Auth.OAuth2.DefaultCredentialProvider.CreateDefaultCredentialAsync()
   at Google.Api.Gax.Grpc.ChannelPool.CreateChannelCredentialsUncached()
   at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
   at Google.Api.Gax.TaskExtensions.ResultWithUnwrappedExceptions[T](Task`1 task)
   at Google.Api.Gax.Grpc.ChannelPool.GetChannel(GrpcAdapter grpcAdapter, String endpoint, GrpcChannelOptions channelOptions)
   at Google.Api.Gax.Grpc.ClientBuilderBase`1.CreateCallInvoker()
   at Google.Cloud.Trace.V2.TraceServiceClientBuilder.BuildImpl()
   at Google.Cloud.Trace.V2.TraceServiceClientBuilder.Build()
   at Appspace.Common.Tracing.GoogleCloudTraceExporter.BuildTraceServiceClient(GoogleCloudTraceExporterOptions options)
   at Appspace.Common.Tracing.GoogleCloudTraceExporter..ctor(GoogleCloudTraceExporterOptions options, TraceServiceClient traceServiceClient)
   at Appspace.Common.Tracing.TracingConfigurationExtensions.ConfigureGoogleCloudTraceExporter(Func`2 getConfiguration, TracerProviderBuilder tracerProviderBuilder)
   at Appspace.Common.Tracing.TracingConfigurationExtensions.CreateTracerProvider(Func`2 getConfiguration, Action`1 configure)
   at Appspace.Common.Tracing.TracingConfigurationExtensions.CreateTracerProvider(IConfiguration configuration)
   at Appspace.Screenshot.Services.Extensions.ContainerExtensions.RegisterService(Container container, IConfiguration configuration, ILoggerFactory loggerFactory) in /data/bamboo-agent-4/xml-data/build-dir/ACM-SS101-JOB1/src/Appspace.Screenshot.Services/Extensions/ContainerExtensions.cs:line 44
   at Appspace.Screenshot.Services.Program.<>c__DisplayClass3_0.<Main>b__0(Container container) in /data/bamboo-agent-4/xml-data/build-dir/ACM-SS101-JOB1/src/Appspace.Screenshot.Services/Program.cs:line 51
   at Appspace.Common.Services.Factories.ServiceFactory.CreateContainer()
   at Appspace.Screenshot.Services.Program.Main(String[] args) in /data/bamboo-agent-4/xml-data/build-dir/ACM-SS101-JOB1/src/Appspace.Screenshot.Services/Program.cs:line 56
  • The issue happens when multiple pods need to be started at the same time >=100 approximately.
  • We opened a ticket to GCP Support and they suggested adding an init container to check the metadata-server before starting up the pods (Even with the workaround the issue persists). [3]
  • We built a very basic application with dotnet-client v1.51 and v1.61 and the issue persists.
  • We’ve tried with a container with only makes curl in loops, and we can not reproduce the issue. We suspect that might be something in the dotnet client.
  • After a meeting with the GKE specialist team, they suggested opening an issue here, because apparently, the issue is not about GKE.

Environment details

  • Programming language: Project Sdk=“Microsoft.NET.Sdk”
  • OS: GKE 1.27.4-gke.900 - Container-Optimized OS with containerd (cos_containerd)
  • Language runtime version: netcoreapp3.1
  • Package version in the workloads: Updated May 20 2022 PackageReference Include=“Google.Cloud.Trace.V2” Version=“2.3.0”
  • Package version in the sample application: <PackageReference Include=“Google.Apis.Auth” Version=“1.61.0” / and 1.51>

Steps to reproduce

  1. Create a GKE cluster with WI.
  2. Follow the documentation [1] and [2] to Configure applications to use Workload Identity.
  3. Build/Push a dotnet application to fetch the token (Tested with Auth version 1.61 and 1.51)
  4. Deploy the k8s manifests in k8s - k8s-deployment-ac5.yaml , k8s-deployment-ac6.yaml and k8s-deployment-curl.yaml
  5. See the previous logs in the pods with restarts.

##### DOTNET version 1.61

zdotnet-ac6-6ff84f8d66-r6rgz                   1/1     Running            1 (7s ago)      87s
kubectl logs --previous zdotnet-ac6-6ff84f8d66-r6rgz
  Unhandled exception. System.AggregateException: One or more errors occurred. (Your default credentials were not found. To set up Application Default Credentials, see https://cloud.google.com/docs/authentication/external/set-up-adc.)
   ---> System.InvalidOperationException: Your default credentials were not found. To set up Application Default Credentials, see https://cloud.google.com/docs/authentication/external/set-up-adc.
     at Google.Apis.Auth.OAuth2.DefaultCredentialProvider.CreateDefaultCredentialAsync()
     --- End of inner exception stack trace ---
     at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
     at System.Threading.Tasks.Task`1.get_Result()
     at Google.Apis.Auth.OAuth2.GoogleCredential.GetApplicationDefault()
     at GkeAdcTestApp.Program.Main(String[] args)
     at GkeAdcTestApp.Program.<Main>(String[] args)

kubectl logs  zdotnet-ac6-6ff84f8d66-r6rgz 
  Access Token: XXXXXXXXXX
  Waiting for 10 seconds before fetching the token again...

##### DOTNET version 1.51

zdotnet-ac5-58b5964d49-tn9hh                   1/1     Running            2 (58s ago)     4m6s

kubectl logs --previous zdotnet-ac5-58b5964d49-tn9hh                            
  Unhandled exception. System.AggregateException: One or more errors occurred. (The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.)
   ---> System.InvalidOperationException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
     at Google.Apis.Auth.OAuth2.DefaultCredentialProvider.CreateDefaultCredentialAsync()
     --- End of inner exception stack trace ---
     at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
     at System.Threading.Tasks.Task`1.get_Result()
     at Google.Apis.Auth.OAuth2.GoogleCredential.GetApplicationDefault()
     at GkeAdcTestApp.Program.Main(String[] args)
     at GkeAdcTestApp.Program.<Main>(String[] args)

kubectl logs -f zdotnet-ac5-58b5964d49-tn9hh                                        
   Access Token: XXXXXXX
    Waiting for 10 seconds before fetching the token again...

Could you help us? Thanks in advance

[1] https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#verify_the_setup [2] https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to [3] https://cloud.google.com/kubernetes-engine/docs/troubleshooting/troubleshooting-security#troubleshoot-timeout

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 17 (7 by maintainers)

Most upvoted comments

@marandalucas , @mduu I’ve merged #2577 that should mitigate your issues. I’ll release new versions of libraries containing this feature on Monday. What this means for you:

  • You should not see issues on startup as we don’t depend on the metadata server being up for identifying Compute credentials as ADC.
  • We still need the metadata server to be available when making API requests (through the libraries or not) because that’s how we obtain the auth token.
  • We have more sturdy retries built in libraries when making API requests. But if those are not enough you can add a stronger retry layer. You can alwasys recover from a failed API request (as opposed to from not detecting ADC).

I’ll continue to work on improving librarie’s resilience to metadata server unstability by improving retries and possibly making those retries configurable by users. But this will take longer as we need to look at some aspects better. In the meantime, #2577 should make startup and scaling of your deployments much less troublesome.

I’ll close this issue now. Do leave new comments or create new issues if you still have trouble at pod startup, etc.

You don’t need to wait for Google.Cloud.Trace.V2 to be updated - you can just add an explicit dependency to Google.Apis.Auth 1.64.0.

(We don’t generally release all our libraries just because there have been improvements to underlying support libraries, unless it’s for a major version bump or a security issue.)

This code has an issue and it’s the there’s currently no recovery from a failed GoogleCredential.GetApplicationDefaultAsync, so basically, the loop and the waits on the case of an InvalidOperationException are not having any effect. Either you obtain the credentials at first try or you don’t obtain them at all, no matter for how long you wait.

As a temporary workaround, and while we work on #2475, you can force your credentials as follows:

var credentials = GoogleCredential.FromComputeCredential();

Note that this won’t work with Fleet Workload Identity via the credential file, but it seems you cannot use that anyway because of the Retail API. It also won’t work when running on other than Google Cloud environments, including local development.

Once you have this workaround in place, your applicaiton will initialize correctly, and any issues with the metadata server being available will be bubbled up during API calls, and from those you can recover by retrying.

I’m working on #2475 and I’ll try to get a minimal palliative solution there asap, before ironing out details for better retry configuration, etc.

Is there a difference in Google’s .Net SDK vs. Google’s Python SDK regarding retry, timeout and communication to the metadata-server? Shouldn’t the SDK behave the same in every supported language?

Yes there are differences across SDKs, some of that is historical, some of that is because of idiomacity concerns and some of that is because feature implementation is prioritized differently across languages.

Is it really only a client issue, or is there a problem with GKE and/or metadata-server in general?

The only client side issue that we have been able to verify, and we’ve been looking hard, is that the we need to improve client side resilience to metadata server unstability. Metadata server unstability, in particular for GKE and WIF at pod startup has been documented.

We are currently working on making the .NET Auth library (which is the relevant part of the Google’s .NET SDK for this issue) more resilient to metadata server instability. Follow this issue and mostly #2475 for that.

Note that Google Retail .Net SDK can only use ADC. Others like Datastore, BigQuery, PubSub can use explicitly mapped workload fleet credential files …

I’d like to note that there’s nothing client side that would have this effect, i.e. Fleet Workload Identity to work with several APIs (Datastore, BigQuery, etc.) but not with others (Retail). This is very likely to be a Retail related issue so it’s probably best to follow up on Retail’s API support channels.

… which relaxes the issue somehow.

Yes, in reality you are using ADC in both instances. The difference is that for Fleet Workload Indentity ADC is configured by having a well known environment variable point to a credential file, whereas if you can’t do that for the Retail API, ADC relies on the metadata server being available. And it being unstable ultimately means that ADC is not available.

Our tests showed that the metadata-server is not reachable up to 10 minutes or so.

This seems extreme even when factoring in all known issues around metadata server instability. Have you tried these tests just with the .NET SDK or also with the Python one (given that you mentioned the Python SDK). Are the conclusions the same regarding metadata server unavailability with both SDKs? Can you please describe your tests? Providing the minimal but complete .NET (and Python if you have it) code for reproducing would be best.

Same issue on our clusters with .Net workloads using the Google SDK’s. It’s getting worst the last couple of weeks/releases. It happens also then deploying new versions (helm install) as well as when HPA scales out. Meaning all workloads startups for whatever reason are affected by this bug, and it can happen even if there is just one pod/instance to be started.

Note that Google Retail .Net SDK can only use ADC. Others like Datastore, BigQuery, PubSub can use explicitly mapped workload fleet credential files which relaxes the issue somehow.

Our tests showed that the metadata-server is not reachable up to 10 minutes or so.

Questions regarding this issue:

  • Is there a difference in Google’s .Net SDK vs. Google’s Python SDK regarding retry, timeout and communication to the metadata-server? Shouldn’t the SDK behave the same in every supported language?
  • Is it really only a client issue, or is there a problem with GKE and/or metadata-server in general?

Stack-Trace we see:

Unhandled exception. System.InvalidOperationException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
   at Google.Apis.Auth.OAuth2.DefaultCredentialProvider.CreateDefaultCredentialAsync()
   at Google.Api.Gax.Grpc.DefaultChannelCredentialsCache.<.ctor>b__5_1()
   at Google.Api.Gax.TaskExtensions.WaitWithUnwrappedExceptions(Task task)
   at Google.Api.Gax.TaskExtensions.ResultWithUnwrappedExceptions[T](Task`1 task)
   at Google.Api.Gax.Grpc.DefaultChannelCredentialsCache.GetCredentials()
   at Google.Api.Gax.Grpc.ChannelPool.GetChannel(GrpcAdapter grpcAdapter, String endpoint, GrpcChannelOptions channelOptions)
   at Google.Api.Gax.Grpc.ClientBuilderBase`1.CreateCallInvoker()
   at Google.Cloud.SecretManager.V1.SecretManagerServiceClientBuilder.BuildImpl()
   at Google.Cloud.SecretManager.V1.SecretManagerServiceClientBuilder.Build()
   at Google.Cloud.SecretManager.V1.SecretManagerServiceClient.Create()
   at OurLibrary.Extensions.Configuration.GoogleSecretManager.SecretManagerConfigurationProvider.Load()
   at Microsoft.Extensions.Configuration.ConfigurationManager.AddSource(IConfigurationSource source)
   at Microsoft.Extensions.Configuration.ConfigurationManager.ConfigurationSources.Add(IConfigurationSource source)
   at Microsoft.Extensions.Configuration.ConfigurationManager.Microsoft.Extensions.Configuration.IConfigurationBuilder.Add(IConfigurationSource source)
   at OurLibrary.Extensions.Configuration.GoogleSecretManager.ConfigurationBuilderExtensions.AddSecretManager(IConfigurationBuilder configurationBuilder, String gcpProjectId, IDictionary`2 secretMappings)
   at Program.<Main>$(String[] args) in /app/OurApp.Hephaestus/Program.cs:line 28
   at Program.<Main>(String[] args)

Assigned to Amanda who has worked on this a lot. (We were literally just talking about this on a call.)

I believe that creating a ComputeCredential explicitly and using that will get past the configuration part - but then as soon as any actual RPCs are made, they will fail if the code still can’t contact the metadata server. Amanda may well have better suggestions.