application-gateway-kubernetes-ingress: AGIC Ingress controller: Liveness probe failed

Describe the bug After creating new chart of application-gateway-kubernetes-ingress/ingress-azure --version 1.0.0 pods created get stuck with Readiness probe failed and “new-pod-ingress-azure” gets in in 0/1 pods running. After running a “describe pod” of that pod we are getting:

Warning  Unhealthy  13s (x10 over 113s)  kubelet, aks-agentpool-76661633-1  Readiness probe failed: Get http://10.2.0.15:8123/health/ready: dial tcp 10.2.0.15:8123: connect: connection refused
  Warning  Unhealthy  7s (x6 over 107s)    kubelet, aks-agentpool-76661633-1  Liveness probe failed: Get http://10.2.0.15:8123/health/alive: dial tcp 10.2.0.15:8123: connect: connection refused

To Reproduce We already created 5 o 6 cluster using this procedure and it´s first time we have this issue: steps: 1- kubectl create -f https://raw.githubusercontent.com/Azure/aad-pod-identity/master/deploy/infra/deployment-rbac.yaml

2-Create azure identity in RG where nodes are running (MC_xxxxxx) and adding WAF as reader in RG and as contributor in azure identity component

3- kubectl create -f aadpodidentity.yml

4- helm install -f dev-chart.yml application-gateway-kubernetes-ingress/ingress-azure --version 1.0.0

5- create binding : yml values: AzureIdentity: name of identity created in step 3. selector: [nameofstep4]-ingress-azure

Ingress Controller details

Name:           bunking-lizard-ingress-azure-7b54978988-rs8h2
Namespace:      default
Priority:       0
Node:           aks-agentpool-76661633-1/10.2.0.4
Start Time:     Mon, 18 Nov 2019 20:42:53 -0300
Labels:         aadpodidbinding=bunking-lizard-ingress-azure
                app=ingress-azure
                pod-template-hash=7b54978988
                release=bunking-lizard
Annotations:    prometheus.io/port: 8123
                prometheus.io/scrape: true
Status:         Running
IP:             10.2.0.15
IPs:            <none>
Controlled By:  ReplicaSet/bunking-lizard-ingress-azure-7b54978988
Containers:
  ingress-azure:
    Container ID:   docker://bd46f4833b77b806132fd7cd7729aca5ffbf117bbee3a2e7c011ffb109894db0
    Image:          mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.0.0
    Image ID:       docker-pullable://mcr.microsoft.com/azure-application-gateway/kubernetes-ingress@sha256:c295f99ae66443c5a392fd894620fcd1fc313b9efdec96d13f166fefb29780a9
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 18 Nov 2019 20:44:58 -0300
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 18 Nov 2019 20:43:59 -0300
      Finished:     Mon, 18 Nov 2019 20:44:58 -0300
    Ready:          False
    Restart Count:  2
    Liveness:       http-get http://:8123/health/alive delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:      http-get http://:8123/health/ready delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment Variables from:
      bunking-lizard-cm-ingress-azure  ConfigMap  Optional: false
    Environment:
      AZURE_CONTEXT_LOCATION:  /etc/appgw/azure.json
      AGIC_POD_NAME:           bunking-lizard-ingress-azure-7b54978988-rs8h2 (v1:metadata.name)
      AGIC_POD_NAMESPACE:      default (v1:metadata.namespace)
    Mounts:
      /etc/appgw/azure.json from azure (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from bunking-lizard-sa-ingress-azure-token-z9hvt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  azure:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/azure.json
    HostPathType:  File
  bunking-lizard-sa-ingress-azure-token-z9hvt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  bunking-lizard-sa-ingress-azure-token-z9hvt
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                  From                               Message
  ----     ------     ----                 ----                               -------
  Normal   Scheduled  2m12s                default-scheduler                  Successfully assigned default/bunking-lizard-ingress-azure-7b54978988-rs8h2 to aks-agentpool-76661633-1
  Normal   Pulling    67s (x2 over 2m11s)  kubelet, aks-agentpool-76661633-1  Pulling image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.0.0"
  Normal   Pulled     67s (x2 over 2m9s)   kubelet, aks-agentpool-76661633-1  Successfully pulled image "mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.0.0"
  Normal   Created    67s (x2 over 2m6s)   kubelet, aks-agentpool-76661633-1  Created container ingress-azure
  Normal   Killing    67s                  kubelet, aks-agentpool-76661633-1  Container ingress-azure failed liveness probe, will be restarted
  Normal   Started    66s (x2 over 2m6s)   kubelet, aks-agentpool-76661633-1  Started container ingress-azure
  Warning  Unhealthy  13s (x10 over 113s)  kubelet, aks-agentpool-76661633-1  Readiness probe failed: Get http://10.2.0.15:8123/health/ready: dial tcp 10.2.0.15:8123: connect: connection refused
  Warning  Unhealthy  7s (x6 over 107s)    kubelet, aks-agentpool-76661633-1  Liveness probe failed: Get http://10.2.0.15:8123/health/alive: dial tcp 10.2.0.15:8123: connec
  • Output of `kubectl logs <ingress controller>.
kubectl logs bunking-lizard-ingress-azure-7b54978988-rs8h2
ERROR: logging before flag.Parse: I1118 23:42:59.206842       1 main.go:302] Using verbosity level 5 from environment variable APPGW_VERBOSITY_LEVEL
I1118 23:42:59.256720       1 environment.go:168] KUBERNETES_WATCHNAMESPACE is not set. Watching all available namespaces.
I1118 23:42:59.256746       1 main.go:132] App Gateway Details: Subscription: mysubsxxxxxxxxxxxxxxxxxxxxxx, Resource Group: MYRESOURCEG, Name: MYAKSCLUSTER
I1118 23:42:59.256752       1 auth.go:90] Creating authorizer from Azure Managed Service Identity
* Any Azure support tickets associated with this issue.
Note. We are using kubernetes version 1.15.5 (preview in Azure) Could this be related?
  • Any Azure support tickets associated with this issue. 119111924000415

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 4
  • Comments: 19 (5 by maintainers)

Most upvoted comments

We are facing the same issue.

To Reproduce

We are installing the Azure App Gateway Ingress Controller as following:

# install azure ad pod identities
./kubernetes-d/linux-amd64/helm repo add aad-pod-identity https://raw.githubusercontent.com/Azure/aad-pod-identity/master/charts
./kubernetes-d/linux-amd64/helm repo update
./kubernetes-d/linux-amd64/helm upgrade --install aad-pod-identity aad-pod-identity/aad-pod-identity --atomic
# install azure app gateway ingress controller
./kubernetes-d/linux-amd64/helm repo add application-gateway-kubernetes-ingress https://appgwingress.blob.core.windows.net/ingress-azure-helm-package/
./kubernetes-d/linux-amd64/helm repo update
# manipulating config.yaml
./kubernetes-d/linux-amd64/helm upgrade -f=./app-gw-ingress/helm-config.yaml --install ingress-azure application-gateway-kubernetes-ingress/ingress-azure --version=$appgwChartVersion --namespace=default --atomic

The applied helm-chart values:

# Specify aks cluster related information. THIS IS BEING DEPRECATED.
# aksClusterConfiguration:
#   apiServerAddress: AKS_API_SERVER_ADDRESS

# The values.yaml file is important to templates.
# This file contains the default values for a chart.
# These values may be overridden during helm install or helm upgrade.
replicaCount: 1

# Verbosity level of the App Gateway Ingress Controller
verbosityLevel: 3

image:
  repository: mcr.microsoft.com/azure-application-gateway/kubernetes-ingress
  tag: 0.7.1
  pullPolicy: Always


# kubernetes:
#     # Namespace(s) AGIC watches; Leaving this blank watches all namespaces;
#     # Accepts one or many comma-separated values
#     watchNamespace:
#     # Port for AGIC's HTTP API endpoint
#     httpServicePort: 8124


################################################################################
# Specify which application gateway the ingress controller will manage
#
appgw:
  subscriptionId: SUBSCRIPTION_ID
  resourceGroup: RESOURCE_GROUP_NAME
  name: APP_GW_NAME

################################################################################
# Specify the authentication with Azure Resource Manager
#
# Two authentication methods are available:
# - Option 1: AAD-Pod-Identity (https://github.com/Azure/aad-pod-identity)
armAuth:
  type: aadPodIdentity
  identityResourceID: ARM_AUTH_RESOURCE_ID
  identityClientID: ARM_AUTH_CLIENT_ID
# - Option 2: ServicePrincipal as a kubernetes secret
# armAuth:
#   type: servicePrincipal
#
#   # Generate this value with:
#   #   az ad sp create-for-rbac --subscription <subscription-uuid> --sdk-auth | base64 -w0
#   secretJSON: <base64-encoded-JSON-blob>

################################################################################
# Specify if the cluster is RBAC enabled or not
rbac:
  enabled: true # true/false

nodeSelector: {}

Describe the bug

  • azure-app-gateway-ingress-controller pod stuck in crashLoopBackOff
  • kubectl describe of the pod shows the Liveness and Readiness probe failing in the same manner as described by the author in the first post

Current workaround

We currently worked around by manually editing the deployment of the azure-app-gateway-ingress-controller and removing the Readiness and Liveness probes there. Afterwards the azure-app-gateway-ingress-controller pod is able to spin up correctly and stays stable in the Running status. kubectl edit deployment azure-app-gateway-ingress-controller -> remove the complete Readiness and Liveness section, so that it looks somehow like this:

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
  creationTimestamp: "2019-11-18T16:34:42Z"
  generation: 3
  labels:
    app: ingress-azure
    chart: ingress-azure-1.0.0
    heritage: Tiller
    release: ingress-azure
  name: ingress-azure
  namespace: default
  resourceVersion: "425992"
  selfLink: /apis/apps/v1/namespaces/default/deployments/ingress-azure
  uid: f357ad89-67ac-49f4-8f12-3041b3d10081
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: ingress-azure
      release: ingress-azure
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/port: "8123"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        aadpodidbinding: ingress-azure
        app: ingress-azure
        release: ingress-azure
    spec:
      containers:
      - env:
        - name: AZURE_CONTEXT_LOCATION
          value: /etc/appgw/azure.json
        - name: AGIC_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: AGIC_POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        envFrom:
        - configMapRef:
            name: ingress-azure
        image: mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:0.7.1
        imagePullPolicy: Always
        name: ingress-azure
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/appgw/azure.json
          name: azure
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: ingress-azure
      serviceAccountName: ingress-azure
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /etc/kubernetes/azure.json
          type: File
        name: azure
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2019-11-20T08:27:25Z"
    lastUpdateTime: "2019-11-20T08:27:25Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2019-11-20T08:27:21Z"
    lastUpdateTime: "2019-11-20T08:27:25Z"
    message: ReplicaSet "ingress-azure-78d4c6c4d6" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 3
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

Further Question

Is it somehow possible to archive a running azure-app-gateway-ingress-controller without the need of manually editing the deployment of it after it was deployed with the helm-chart?

The workaround to remove Liveness Probe and Readiness Probe from the Deployment, as suggested by @gitflo1, seems to be doing the trick.

Blocks removed from AGIC’s Deployment:

imagen

Causes the running AGIC pod to be replaced by another new one, as expected.

Checking the new pod logs:

$ kubectl log deadly-ibex-ingress-azure-c97797d69-btp8f

log is DEPRECATED and will be removed in a future version. Use logs instead.
ERROR: logging before flag.Parse: I1121 16:20:29.990455       1 main.go:302] Using verbosity level 3 from environment variable APPGW_VERBOSITY_LEVEL
I1121 16:20:30.028429       1 environment.go:168] KUBERNETES_WATCHNAMESPACE is not set. Watching all available namespaces.
I1121 16:20:30.028470       1 main.go:132] App Gateway Details: Subscription: <SubID>, Resource Group: <RG-Name>, Name: <AppGw-Name>
I1121 16:20:30.028494       1 auth.go:90] Creating authorizer from Azure Managed Service Identity
I1121 16:20:30.499006       1 main.go:179] Ingress Controller will observe all namespaces.
I1121 16:20:30.531956       1 context.go:129] k8s context run started
I1121 16:20:30.532008       1 context.go:168] Waiting for initial cache sync
I1121 16:20:30.632318       1 context.go:176] Initial cache sync done
I1121 16:20:30.632344       1 context.go:177] k8s context run finished
I1121 16:20:30.632487       1 worker.go:35] Worker started
I1121 16:20:30.632747       1 httpserver.go:57] Starting API Server on :8123
I1121 16:20:30.791415       1 mutate_app_gateway.go:154] BEGIN AppGateway deployment
I1121 16:20:31.198186       1 mutate_app_gateway.go:182] Applied App Gateway config in 406.740784ms
I1121 16:20:31.198218       1 mutate_app_gateway.go:198] cache: Updated with latest applied config.
I1121 16:20:31.198685       1 mutate_app_gateway.go:203] END AppGateway deployment

AGIC pod is able to reach Application Gateway now, and update its properties…

Same issue here, because the Liveness probe failed

@akshaysngupta thank you for resolving this issue with the RC 1.0.1-rc1. I just tested the RC helm-chart version on our prior failing AKS clusters and the issue was resolved.

@gitflo1 Thanks for asking.

  1. No, not yet. I am yet to test it myself before I can be fully confident.
  2. Yes, we will publish a release candidate after this PR is merged.

this is new issue, as we already deployed in last weeks at least 4 o 5 AKS clusters with AGIC lattest version 1.0.0, and pods started running successfully.-