kubelogin: Azure CLI Login Method Does Not Work Properly with AAD Enabled AKS Clusters and Multiple Deployment Service Principals

Azure CLI Login Method Does Not Work Properly with AAD Enabled AKS Clusters and Multiple Deployment Service Principals

Overview

We are deploying applications to Azure Active Directory (AAD) enabled clusters and are required to use kubelogin. Since these are automated deployment pipelines we cannot have interactive logins. We deploy using Azure service principals so we need to use the Azure CLI login method in kubelogin.

The Azure CLI login method does not work properly with AAD enabled AKS clusters when more than one service principal is used by Azure DevOps deployment pipelines on the same build agent. Each pipeline is producing the same token cache file name even though they are for different service principals. This results in helm and kubectl commands failing due to authorization errors.

This problem became an issue when some deployment pipelines were using kubelogin v0.0.14 and others were using v0.0.10.

Scenario

There are two different Azure Active Directory (AAD) enabled AKS clusters.
Each AAD enabled cluster has its own service principal to deploy applications to the cluster. This is required since the clusters are for different applications and permissions must be assigned appropriately.
For Azure DevOps deployment pipelines, the cluster configuration file is retrieved for the cluster using the az aks get-credentials Azure CLI command. The configuration needs to be converted to use Azure CLI logins (non-interactive) as documented here.
An Azure DevOps deployment pipeline is executed for cluster 1 on a given build agent. This is running using its deployment service principal (SP) SP 1. This results in a token cache file in ~/.kube/cache/kubelogin with a file name like: AzurePublicCloud-6dae42f8-4368-4678-94ff-3960e28e3630–.json. NOTE that the last two of the tokens in the name are BLANK.
An Azure DevOps deployment pipeline is executed for cluster 2 on the same build agent as the first pipeline soon after the first pipeline completes. This is running using its deployment service principal (SP) SP 2. This results in a token cache file in ~/.kube/cache/kubelogin with a the same file name as the name generated for the first pipeline.

This causes pipeline 2 to fail with authorization errors since the access token in the cache file is for the wrong service principal.

Source and cluster config info

Source code to generate token file

Cache token file name is generated here:

cacheFile := getCacheFileName(o.Environment, o.ServerID, o.ClientID, o.TenantID, o.IsLegacy)

And you see the name is formed as follows:

func getCacheFileName(environment, serverID, clientID, tenantID string, legacy bool) string {
	// format: ${environment}-${server-id}-${client-id}-${tenant-id}[_legacy].json
	cacheFileNameFormat := "%s-%s-%s-%s.json"
	if legacy {
		cacheFileNameFormat = "%s-%s-%s-%s_legacy.json"
	}
	return fmt.Sprintf(cacheFileNameFormat, environment, serverID, clientID, tenantID)
}

The last two tokens in the generated file name are blank, so the ClientID and TenantID must not be set. The user information in the original cluster configuration is:

users:
- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    auth-provider:
      config:
        apiserver-id: 6dae42f8-4368-4678-94ff-3960e28e3630
        client-id: 80faf920-1908-4b52-b5ef-a8e7bedfc67a
        config-mode: '1'
        environment: AzurePublicCloud
        tenant-id: ca56a4a5-e300-406a-98ff-7e36a0baac5b
      name: azure

And after kubelogin convert-kubeconfig -l azurecli it is (Azure DevOps is masking the server-id value using ***):

- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --server-id
      - ***
      - --login
      - azurecli
      command: kubelogin
      env: null
      provideClusterInfo: false

This would explain how clientID and tenantID are blank in the generated token cache file name: neither the clientId or tenantID are passed to kubelogin as arguments.

Changes Between kubelogin v0.0.10 and v0.0.14

It doesn’t look like the code that generates the token cache file name has changed between v.0.0.10 and v.0.0.14. So I’m not sure how this ever worked for us. Perhaps we were lucky and builds ran on different build agents and the name collision wasn’t an issue. More likely the token expired and a new one was created (and cache file updated) so the issue was hidden.

However, the contents on the file DID change.

In v0.0.10, the resource property in the token file is blank:

"resource":""

While in v.0.14, it has a value:

"resource":"6dae42f8-4368-4678-94ff-3960e28e3630"

It looks like in v0.0.14 it is putting the object id of the service principal in the resource property. That won’t work if two different pipelines using different service principals use the same kubelogin token cache file!

We need to update to the latest version of kubelogin especially since Kubernetes 1.24 requires the use of kubelogin with AAD enabled clusters. The current version of kubelogin (v0.0.14) is also required since it handles the new cluster configuration file that is generated starting in Kubernetes 1.24. We cannot stay on older versions of kubelogin (nor should we so that we keep up with bug fixes and security fixes).

Possible Workaround?

The kubelogin docs do not mention this (but kubelogin convert-kubeconfig --help does), but the kubelogin executable allows you to specify --client-id <client ID>. In my test YAML pipeline, the convert-kubeconfig -l azurecli is running in an Azure CLI task. I made the following changes:

Added addSpnToEnvironment: true to the task properties.
Changed the command to: kubelogin convert-kubeconfig -l azurecli --client-id $env:servicePrincipalId

This results in the following in the converted cluster config file (Azure DevOps masks the service principal id in log output. The *** is the service principal id):

users:
- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --server-id
      - 6dae42f8-4368-4678-94ff-3960e28e3630
      - --client-id
      - ***
      - --login
      - azurecli
      command: kubelogin
      env: null
      provideClusterInfo: false

And I get file names like AzurePublicCloud-6dae42f8-4368-4678-94ff-3960e28e3630-***-.json

*** is the service principal id. Azure DevOps masks the service principal id in log output.

So the clientID value is set in the file name.

Kubelogin also supports -t <tenant ID> but that did not get written to the converted cluster config file.

Is this an appropriate solution to this issue?

kubelogin v.0.0.14 Does not Honor --client-id Flag with Kubernetes 1.24 Cluster Config Files

With Kubernetes 1.24, kubelogin v0.0.14 is also broken. The cluster config file you get is:

users:
- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --environment
      - AzurePublicCloud
      - --server-id
      - 6dae42f8-4368-4678-94ff-3960e28e3630
      - --client-id
      - 80faf920-1908-4b52-b5ef-a8e7bedfc67a
      - --tenant-id
      - ca56a4a5-e300-406a-98ff-7e36a0baac5b
      - --login
      - devicecode
      command: kubelogin
      env: null

and if you convert to to use Azure CLI login like our workaround before: kubelogin convert-kubeconfig -l azurecli --client-id $env:servicePrincipalId

The converted file no longer contains the client-id argument:

- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --server-id
      - 6dae42f8-4368-4678-94ff-3960e28e3630
      - --login
      - azurecli
      command: kubelogin
      env: null
      provideClusterInfo: false

AND YOU GET DUPLICATE FILE NAMES AGAIN!

Conclusion

The kubelogin code needs to generate unique file names for the token cache file name when using the Azure CLI login method. It cannot assume that the same service principal is being used.

Kevin Kizer Lead Engineer MetLife

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 19 (9 by maintainers)

Commits related to this issue

added --token-cache-dir support in convert-kubeconfig (#105) * added --token-cache-dir in convert-kubeconfig sub-command. Addresses #104 — committed to Azure/kubelogin by weinong 2 years ago

Most upvoted comments

The fix does exactly what we need.

In my Azure DevOps pipeline I get the cluster configuration using az aks-getcredentials and convert it using: kubelkogin convert-kubeconfig -l azurecli --token-cache-dir $(Agent.TempDirectory)

and the cluster configuration file contains:

- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --login
      - azurecli
      - --server-id
      - 6dae42f8-4368-4678-94ff-3960e28e3630
      - --token-cache-dir
      - /AzDOAgents/Agent1/_work/_temp
      command: kubelogin
      env: null
      provideClusterInfo: false

It has the token-cache-dir argument!

We still need to test the change with multiple builds on the same build machine and each build uses a different service principal. I don’t expect any issues since each build would have its own agent temp directory.

kkizer on Jul 19, 2022