kubernetes: Health Checks failed outside of ingress controller AWS NLB

Hi there,

I am having the exact same issue than #74948 but cannot get it to work. I do have exactly the same conf in another cluster and it does work fine. I have been struggling with that for one day now, don’t know where to look anymore.

We installed with the official Helm chart:

Chart version: 1.6.18 Image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.24.1

Values:

controller:
  kind: "DaemonSet"
  replicaCount: 2
  ingressClass: nginx
  daemonset:
    useHostPort: true
  service:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
      kubernetes.io/ingress.class: nginx
    externalTrafficPolicy: "Local"
  publishService:
    enabled: true
  config:
    use-proxy-protocol: "false"
defaultBackend:
  replicaCount: 1

What happened:

On NLB side, health checks are failing. We tried to do simple curl with in node where the pod is created and we are getting 503 error message.

[ec2-user@ip-10-30-10-164 ~]$ curl -I localhost:30080/healthz
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Date: Fri, 02 Aug 2019 10:19:49 GMT
Content-Length: 112

[ec2-user@ip-10-30-10-164 ~]$ curl localhost:30080/healthz
{
	"service": {
		"namespace": "ingress",
		"name": "nginx-ingress-public-controller"
	},
	"localEndpoints": 0

Same request is working with in the pod

www-data@nginx-ingress-public-controller-rnmf5:/etc/nginx$ curl -I localhost:10254/healthz
HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Fri, 02 Aug 2019 10:21:00 GMT
Content-Length: 2

www-data@nginx-ingress-public-controller-rnmf5:/etc/nginx$ curl localhost:10254/healthz
okwww-data@nginx-ingress-public-controller-rnmf5:/etc/nginx$

What you expected to happen:

We should get 200 responce on this endpoint.

How to reproduce it (as minimally and precisely as possible):

EKS cluster 1.13, Amazon Linux EKS optimized nodes, deploy nginx ingress with helm with the values above.

Anything else we need to know?:

I checked the NLB created, target groups and healthcheck, everything is fine, the issue is only with the healthcheck within the node and the svc/pod.

Service:

$ k describe svc nginx-ingress-public-controller
Name:                     nginx-ingress-public-controller
Namespace:                ingress
Labels:                   app=nginx-ingress
                          chart=nginx-ingress-1.6.18
                          component=controller
                          heritage=Tiller
                          release=nginx-ingress-public
Annotations:              kubernetes.io/ingress.class: nginx-public
                          service.beta.kubernetes.io/aws-load-balancer-type: nlb
Selector:                 app=nginx-ingress,component=controller,release=nginx-ingress-public
Type:                     LoadBalancer
IP:                       172.20.130.89
LoadBalancer Ingress:     ****************************.elb.eu-west-1.amazonaws.com
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  32030/TCP
Endpoints:                10.30.10.91:80,10.30.11.12:80
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  30460/TCP
Endpoints:                10.30.10.91:443,10.30.11.12:443
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     30080
Events:
  Type    Reason                Age   From                Message
  ----    ------                ----  ----                -------
  Normal  EnsuringLoadBalancer  34m   service-controller  Ensuring load balancer
  Normal  EnsuredLoadBalancer   34m   service-controller  Ensured load balancer

Pods:

k describe pod nginx-ingress-public-controller-rnmf5
Name:               nginx-ingress-public-controller-rnmf5
Namespace:          ingress
Priority:           0
PriorityClassName:  <none>
Node:               ip-10-30-10-164.eu-west-1.compute.internal/10.30.10.164
Start Time:         Fri, 02 Aug 2019 11:31:04 +0200
Labels:             app=nginx-ingress
                    component=controller
                    controller-revision-hash=56df99c5d9
                    pod-template-generation=1
                    release=nginx-ingress-public
Annotations:        kubernetes.io/psp: eks.privileged
Status:             Running
IP:                 10.30.10.91
Controlled By:      DaemonSet/nginx-ingress-public-controller
Containers:
  nginx-ingress-controller:
    Container ID:  docker://b6e37c0fc66c1fde5cb61ecf1a8e502e9f795197a7823523dc710e35ecc7c7cd
    Image:         quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.24.1
    Image ID:      docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:76861d167e4e3db18f2672fd3435396aaa898ddf4d1128375d7c93b91c59f87f
    Ports:         80/TCP, 443/TCP
    Host Ports:    80/TCP, 443/TCP
    Args:
      /nginx-ingress-controller
      --default-backend-service=ingress/nginx-ingress-public-default-backend
      --publish-service=ingress/nginx-ingress-public-controller
      --election-id=ingress-controller-leader
      --ingress-class=nginx
      --configmap=ingress/nginx-ingress-public-controller
    State:          Running
      Started:      Fri, 02 Aug 2019 11:31:05 +0200
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       nginx-ingress-public-controller-rnmf5 (v1:metadata.name)
      POD_NAMESPACE:  ingress (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-public-token-h4l88 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  nginx-ingress-public-token-h4l88:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nginx-ingress-public-token-h4l88
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type    Reason     Age   From                                                 Message
  ----    ------     ----  ----                                                 -------
  Normal  Scheduled  38m   default-scheduler                                    Successfully assigned ingress/nginx-ingress-public-controller-rnmf5 to ip-10-30-10-164.eu-west-1.compute.internal
  Normal  Pulled     38m   kubelet, ip-10-30-10-164.eu-west-1.compute.internal  Container image "quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.24.1" already present on machine
  Normal  Created    38m   kubelet, ip-10-30-10-164.eu-west-1.compute.internal  Created container
  Normal  Started    38m   kubelet, ip-10-30-10-164.eu-west-1.compute.internal  Started container

Environment:

$ k version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-07T09:55:27Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.7-eks-c57ff8", GitCommit:"c57ff8e35590932c652433fab07988da79265d5b", GitTreeState:"clean", BuildDate:"2019-06-07T20:43:03Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:

$ aws --version
aws-cli/1.16.170 Python/3.7.3 Darwin/18.6.0 botocore/1.12.160

OS (e.g: cat /etc/os-release):

[ec2-user@ip-10-30-10-164 ~]$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

Kernel (e.g. uname -a):

[ec2-user@ip-10-30-10-164 ~]$ uname -a
Linux ip-10-30-10-164.pp.meero 4.14.128-112.105.amzn2.x86_64 #1 SMP Wed Jun 19 16:53:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 13
  • Comments: 66 (15 by maintainers)

Most upvoted comments

Think, I’ve found out why: externalTrafficPolicy: "Local" This by default sets up the HTTP healthcheck on a random port unless:

controller.service.healthCheckNodePort

If controller.service.type is NodePort or LoadBalancer and controller.service.externalTrafficPolicy is set to Local, set this to the managed health-check port the kube-proxy will expose. If blank, a random port in the NodePort range will be assigned

Source: https://github.com/helm/charts/tree/master/stable/nginx-ingress

I changed this: externalTrafficPolicy: "Cluster" and had to redeploy the Service. Now the healthcheck is configured with a TCP port and I get healthy targets.

So, if you don’t care about preserving the source IP address then that ^ could be your workaround, if not you must specify the healthCheckNodePort value.

After struggling with this issue for an inordinate amount of time, I think I finally found a solution that allows us to use externalTrafficPolicy=Cluster (and thereby avoid the problems in this issue and others) while still preserving source IP.

What works for us is enabling Proxy Protocol V2 (manually) for the NLB’s Target Group and then configuring nginx-ingress to use the real-ip-header from the proxy protocol.

nginx-configuration configmap

kind: ConfigMap
apiVersion: v1
metadata:
  name: nginx-configuration
  namespace: ingress-nginx
  labels:
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
data:
  use-proxy-protocol: "true"
  real-ip-header: "proxy_protocol"

Another thing to note. I had originally changed the externalTrafficPolicy to Local in order to preserve source IP. Once you’ve set it to Local, you can’t simply change it back Cluster without completely removing and re-creating the service and NLB.

After making the changes above all nodes now report healthy in the Target Group (even though we’re only running nginx on a subset of those nodes) and we are seeing the correct source IP in nginx access logs.

I experienced the same problem (with CLB) and found the cause, thanks to #80579.

It happens if the name of the node the pod residing is different to the Linux hostname.

Thank you @yashwanthkalva, those steps will make it work.

However, the current behaviour is a bug. It should be possible to have both source IP and health checks. We had to manually configure NLBs pointing to NodePort services to make it work with health checks, so it is possible.

We just need it to also work when setup automatically.

Following the steps here, https://aws.amazon.com/blogs/opensource/network-load-balancer-nginx-ingress-controller-eks/ and To make public NLB work with ingress controller, externalTrafficPolicy in the ingress controller manifest needs to be changed from “Local” to “Cluster” Only then NLB target groups health checks will pass. It started working for me.

Yes, but you lose the client ip address which is explicitly set via externalTrafficPolicy: "Local". For us this is not an option.

It’s important to recognize that ExternalTrafficPolicy is not a way to preserve source IP; it’s a change in networking policy that happens to preserve source IP.

Preserving Source IP with Kubernetes ingress - How else can you preserve source IP with Kubernetes? If your external load balancer is a Layer 7 load balancer, the X-Forwarded-For header will also propagate client IP. If you are using a Layer 4 load balancer, you can use the PROXY protocol.

https://blog.getambassador.io/externaltrafficpolicy-local-on-kubernetes-e66e498212f9

It is correct that setting the policy is meant to change the loadbalancing itself but additionally it is officially documented to set this policy to preserve the source ip: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip

The initial posting describes a concrete configuration that should just work but it doesn’t, @mariusmarais just mentioned.

I was randomly checking out the EKS EC2 nodes and found this behaviour. 1.14 cluster with Nginx 0.26.1 - well if the controller is deployed to all nodes, then it is okay.

Spent hours on this yesterday - same issue as well - new EKS 1.14 cluster.

Same here with EKS 1.14 and Nginx 0.26.1.

Having the same issue, but with multiple services using the LoadBalancer type with Classic and NLB on AWS EKS.

@RiceBowlJr Would you help share your EKS cluster arn to me(yyyng@amazon.com), that would help debug 🤣