application-gateway-kubernetes-ingress: AGIC 1.7.0 running into segmentation fault when using workload identity
Describe the bug k8s version: 1.25.6 AGIC version: 1.7.0 I have to mention that we had a k8s upgrade from 1.25.4 although I dont really believe that this is related.
We had the AGIC running with workload identity once, however now it is running into a segmentation fault shortly after startup UPDATE: we identified why it was working before, see below in repro steps
I0414 07:44:24.141494 1 utils.go:114] Using verbosity level 3 from environment variable APPGW_VERBOSITY_LEVEL
I0414 07:44:24.176221 1 supported_apiversion.go:70] server version is: 1.25.6
I0414 07:44:24.187027 1 environment.go:294] KUBERNETES_WATCHNAMESPACE is not set. Watching all available namespaces.
I0414 07:44:24.187049 1 main.go:118] Using User Agent Suffix='***' when communicating with ARM
I0414 07:44:24.187126 1 main.go:137] Application Gateway Details: Subscription="***" Resource Group="***" Name="****"
I0414 07:44:24.187137 1 auth.go:58] Creating authorizer using Default Azure Credentials
I0414 07:44:24.187145 1 httpserver.go:57] Starting API Server on :8123
I0414 07:44:24.423737 1 main.go:184] Ingress Controller will observe all namespaces.
I0414 07:44:24.486990 1 context.go:171] k8s context run started
I0414 07:44:24.487037 1 context.go:238] Waiting for initial cache sync
I0414 07:44:24.487117 1 reflector.go:219] Starting reflector *v1.Pod (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487413 1 reflector.go:219] Starting reflector *v1.Endpoints (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487427 1 reflector.go:255] Listing and watching *v1.Endpoints from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487431 1 reflector.go:255] Listing and watching *v1.Pod from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487119 1 reflector.go:219] Starting reflector *v1.Secret (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487728 1 reflector.go:255] Listing and watching *v1.Secret from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488102 1 reflector.go:219] Starting reflector *v1.Ingress (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488112 1 reflector.go:255] Listing and watching *v1.Ingress from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488247 1 reflector.go:219] Starting reflector *v1beta1.AzureApplicationGatewayRewrite (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488264 1 reflector.go:255] Listing and watching *v1beta1.AzureApplicationGatewayRewrite from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488461 1 reflector.go:219] Starting reflector *v1.IngressClass (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488474 1 reflector.go:255] Listing and watching *v1.IngressClass from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487136 1 reflector.go:219] Starting reflector *v1.Service (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488982 1 reflector.go:255] Listing and watching *v1.Service from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.587595 1 context.go:251] Initial cache sync done
I0414 07:44:24.587633 1 context.go:252] k8s context run finished
I0414 07:44:24.587758 1 worker.go:39] Worker started
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1375dcf]
goroutine 166 [running]:
github.com/Azure/application-gateway-kubernetes-ingress/pkg/appgw.(*appGwConfigBuilder).newListener(0xc000842ea0, 0x0?, {0x50, {{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...)
/azure/pkg/appgw/frontend_listeners.go:155 +0x6f
github.com/Azure/application-gateway-kubernetes-ingress/pkg/appgw.(*appGwConfigBuilder).getListeners(0xc000842ea0, 0xc0006de200)
/azure/pkg/appgw/frontend_listeners.go:39 +0x2f3
github.com/Azure/application-gateway-kubernetes-ingress/pkg/appgw.(*appGwConfigBuilder).Listeners(0xc000842ea0, 0xc0006de200?)
/azure/pkg/appgw/http_listeners.go:11 +0x58
github.com/Azure/application-gateway-kubernetes-ingress/pkg/appgw.(*appGwConfigBuilder).Build(0xc000842ea0, 0x32b2?)
/azure/pkg/appgw/configbuilder.go:119 +0x338
github.com/Azure/application-gateway-kubernetes-ingress/pkg/controller.AppGwIngressController.MutateAppGateway({{0x194b4e0, 0xc00014a000}, {{0xc00004a156, 0x24}, {0xc0000460d5, 0x10}, {0xc00004c00b, 0x14}}, 0xc0002db9b0, 0xc000528180, ...}, ...)
/azure/pkg/controller/mutate_app_gateway.go:128 +0x7b3
github.com/Azure/application-gateway-kubernetes-ingress/pkg/controller.(*AppGwIngressController).ProcessEvent(0xc0001a35e0, {0xc00065af20?, {0x16d5d40?, 0xc0007a4140?}})
/azure/pkg/controller/controller.go:134 +0x32c
github.com/Azure/application-gateway-kubernetes-ingress/pkg/worker.(*Worker).Run(0xc0001b03a0, 0xc0002ed380, 0xc000306de0)
/azure/pkg/worker/worker.go:61 +0x405
created by github.com/Azure/application-gateway-kubernetes-ingress/pkg/controller.(*AppGwIngressController).Start
/azure/pkg/controller/controller.go:83 +0x205
To Reproduce Steps to reproduce the behavior: start AGIC 1.7.0 with workload identity
UPDATE: we reproduced an old scenario where we had the AGW configured by the AGIC 1.6.0 and then rolled the upgrade on the AGIC to 1.7.0, it is working now (not sure for how long tho) - This means it has some issues with running on an empty AGW
Ingress Controller details
- Output of
kubectl describe pod <ingress controller
> . The <ingress controller> pod name can be obtained by runninghelm list
.
Name: agic-7858bff8cf-tjffq
Namespace: agic
Priority: 0
Node: aks-system01-84170117-vmss000004/172.24.0.5
Start Time: Fri, 14 Apr 2023 09:44:02 +0200
Labels: app=agic
azure.workload.identity/use=true
pod-template-hash=7858bff8cf
release=agic
Annotations: checksum/config: dab60ff6dc06a214e2ce0d2cf9f30176a9e2b0af8bad56750f90242c63fbc99d
cni.projectcalico.org/containerID: ce203a623741703aa925b11d4450498559fe6d34857d28f3c551d9c5f9c7afe1
cni.projectcalico.org/podIP: 100.127.0.5/32
cni.projectcalico.org/podIPs: 100.127.0.5/32
prometheus.io/port: 8123
prometheus.io/scrape: true
Status: Running
IP: 100.127.0.5
IPs:
IP: 100.127.0.5
Controlled By: ReplicaSet/agic-7858bff8cf
Containers:
agic:
Container ID: containerd://b158456ea3e24c9cacf50e0ffb194843a3c0f45e7f5f28ba72f1952019ac87db
Image: acr.azurecr.io/mirror/mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.7.0
Image ID: acr.azurecr.io/mirror/mcr.microsoft.com/azure-application-gateway/kubernetes-ingress@sha256:2a0b42820413811e9294f38f66ccb8cdc32d829263a620231bd5977a1a464888
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Fri, 14 Apr 2023 10:00:12 +0200
Finished: Fri, 14 Apr 2023 10:00:13 +0200
Ready: False
Restart Count: 8
Liveness: http-get http://:8123/health/alive delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8123/health/ready delay=5s timeout=1s period=10s #success=1 #failure=3
Environment Variables from:
agic ConfigMap Optional: false
Environment:
AZURE_CLOUD_PROVIDER_LOCATION: /etc/appgw/azure.json
AGIC_POD_NAME: agic-7858bff8cf-tjffq (v1:metadata.name)
AGIC_POD_NAMESPACE: agic (v1:metadata.namespace)
AZURE_CLIENT_ID: ***
AZURE_TENANT_ID: ***
AZURE_FEDERATED_TOKEN_FILE: /var/run/secrets/azure/tokens/azure-identity-token
AZURE_AUTHORITY_HOST: https://login.microsoftonline.com/
Mounts:
/etc/appgw/ from azure (ro)
/var/run/secrets/azure/tokens from azure-identity-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-s584n (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
azure:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/
HostPathType: Directory
kube-api-access-s584n:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
azure-identity-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3600
QoS Class: BestEffort
Node-Selectors: kubernetes.azure.com/mode=system
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 20m default-scheduler Successfully assigned agic/agic-7858bff8cf-tjffq to aks-system01-84170117-vmss000004
Normal Pulled 20m kubelet Successfully pulled image "acr.azurecr.io/mirror/mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.7.0" in 162.963226ms (162.971926ms including waiting)
Normal Pulled 20m kubelet Successfully pulled image "acr.azurecr.io/mirror/mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.7.0" in 148.280113ms (148.287113ms including waiting)
Normal Pulled 19m kubelet Successfully pulled image "acr.azurecr.io/mirror/mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.7.0" in 800.836019ms (800.844419ms including waiting)
Normal Pulling 19m (x4 over 20m) kubelet Pulling image "ritappacr.azurecr.io/mirror/mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.7.0"
Normal Created 19m (x4 over 20m) kubelet Created container agic
Normal Started 19m (x4 over 20m) kubelet Started container agic
Normal Pulled 19m kubelet Successfully pulled image "ritappacr.azurecr.io/mirror/mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.7.0" in 1.101317507s (1.101328307s including waiting)
Warning BackOff 6s (x105 over 20m) kubelet Back-off restarting failed container
- logs: see above
- Any Azure support tickets associated with this issue. -> NONE
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 15
sry haven’t had this on my radar anymore… there was a new release 3 weeks ago which should fix all issues that are related to the fix being implemented in a new image with the same old tag… If there are any further issues, I suggest opening a new dedicated issue b/c this here has been fixed
When did they do that? I deployed 1.7.0 earlier today and the error was still present. I then created my own image from the commit in #1538 and it worked…
@cloebig the helm chart is still on 1.6.0, but you could actually bump the image to for example 1.7.0 by setting the
image.tag
value (you can see here that it is configurable in the values file)they have overridden the old 1.7.0 tag… digests changed… re-pulling the image will do it… its not best practice but that is what happened ¯\_(ツ)_/¯
Fix isn’t present in any release yet though. Is there and e.t.a on a 1.7.1 or 1.8.0 release that will include this?
We seem to have the same issue. When we use version 1.6 with service principal, the deployment works. Then when we upgrade to version 1.7 with managed identity, it also works. But when we use version 1.7 from scratch, we have the same error.