kubernetes: Health Checks failed outside of ingress controller AWS NLB
Hi there,
I am having the exact same issue than #74948 but cannot get it to work. I do have exactly the same conf in another cluster and it does work fine. I have been struggling with that for one day now, don’t know where to look anymore.
We installed with the official Helm chart:
Chart version: 1.6.18 Image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.24.1
Values:
controller:
kind: "DaemonSet"
replicaCount: 2
ingressClass: nginx
daemonset:
useHostPort: true
service:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
kubernetes.io/ingress.class: nginx
externalTrafficPolicy: "Local"
publishService:
enabled: true
config:
use-proxy-protocol: "false"
defaultBackend:
replicaCount: 1
What happened:
On NLB side, health checks are failing. We tried to do simple curl with in node where the pod is created and we are getting 503 error message.
[ec2-user@ip-10-30-10-164 ~]$ curl -I localhost:30080/healthz
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Date: Fri, 02 Aug 2019 10:19:49 GMT
Content-Length: 112
[ec2-user@ip-10-30-10-164 ~]$ curl localhost:30080/healthz
{
"service": {
"namespace": "ingress",
"name": "nginx-ingress-public-controller"
},
"localEndpoints": 0
Same request is working with in the pod
www-data@nginx-ingress-public-controller-rnmf5:/etc/nginx$ curl -I localhost:10254/healthz
HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Fri, 02 Aug 2019 10:21:00 GMT
Content-Length: 2
www-data@nginx-ingress-public-controller-rnmf5:/etc/nginx$ curl localhost:10254/healthz
okwww-data@nginx-ingress-public-controller-rnmf5:/etc/nginx$
What you expected to happen:
We should get 200 responce on this endpoint.
How to reproduce it (as minimally and precisely as possible):
EKS cluster 1.13, Amazon Linux EKS optimized nodes, deploy nginx ingress with helm with the values above.
Anything else we need to know?:
I checked the NLB created, target groups and healthcheck, everything is fine, the issue is only with the healthcheck within the node and the svc/pod.
Service:
$ k describe svc nginx-ingress-public-controller
Name: nginx-ingress-public-controller
Namespace: ingress
Labels: app=nginx-ingress
chart=nginx-ingress-1.6.18
component=controller
heritage=Tiller
release=nginx-ingress-public
Annotations: kubernetes.io/ingress.class: nginx-public
service.beta.kubernetes.io/aws-load-balancer-type: nlb
Selector: app=nginx-ingress,component=controller,release=nginx-ingress-public
Type: LoadBalancer
IP: 172.20.130.89
LoadBalancer Ingress: ****************************.elb.eu-west-1.amazonaws.com
Port: http 80/TCP
TargetPort: http/TCP
NodePort: http 32030/TCP
Endpoints: 10.30.10.91:80,10.30.11.12:80
Port: https 443/TCP
TargetPort: https/TCP
NodePort: https 30460/TCP
Endpoints: 10.30.10.91:443,10.30.11.12:443
Session Affinity: None
External Traffic Policy: Local
HealthCheck NodePort: 30080
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal EnsuringLoadBalancer 34m service-controller Ensuring load balancer
Normal EnsuredLoadBalancer 34m service-controller Ensured load balancer
Pods:
k describe pod nginx-ingress-public-controller-rnmf5
Name: nginx-ingress-public-controller-rnmf5
Namespace: ingress
Priority: 0
PriorityClassName: <none>
Node: ip-10-30-10-164.eu-west-1.compute.internal/10.30.10.164
Start Time: Fri, 02 Aug 2019 11:31:04 +0200
Labels: app=nginx-ingress
component=controller
controller-revision-hash=56df99c5d9
pod-template-generation=1
release=nginx-ingress-public
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.30.10.91
Controlled By: DaemonSet/nginx-ingress-public-controller
Containers:
nginx-ingress-controller:
Container ID: docker://b6e37c0fc66c1fde5cb61ecf1a8e502e9f795197a7823523dc710e35ecc7c7cd
Image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.24.1
Image ID: docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:76861d167e4e3db18f2672fd3435396aaa898ddf4d1128375d7c93b91c59f87f
Ports: 80/TCP, 443/TCP
Host Ports: 80/TCP, 443/TCP
Args:
/nginx-ingress-controller
--default-backend-service=ingress/nginx-ingress-public-default-backend
--publish-service=ingress/nginx-ingress-public-controller
--election-id=ingress-controller-leader
--ingress-class=nginx
--configmap=ingress/nginx-ingress-public-controller
State: Running
Started: Fri, 02 Aug 2019 11:31:05 +0200
Ready: True
Restart Count: 0
Liveness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME: nginx-ingress-public-controller-rnmf5 (v1:metadata.name)
POD_NAMESPACE: ingress (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-public-token-h4l88 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
nginx-ingress-public-token-h4l88:
Type: Secret (a volume populated by a Secret)
SecretName: nginx-ingress-public-token-h4l88
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 38m default-scheduler Successfully assigned ingress/nginx-ingress-public-controller-rnmf5 to ip-10-30-10-164.eu-west-1.compute.internal
Normal Pulled 38m kubelet, ip-10-30-10-164.eu-west-1.compute.internal Container image "quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.24.1" already present on machine
Normal Created 38m kubelet, ip-10-30-10-164.eu-west-1.compute.internal Created container
Normal Started 38m kubelet, ip-10-30-10-164.eu-west-1.compute.internal Started container
Environment:
$ k version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-07T09:55:27Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.7-eks-c57ff8", GitCommit:"c57ff8e35590932c652433fab07988da79265d5b", GitTreeState:"clean", BuildDate:"2019-06-07T20:43:03Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration:
$ aws --version
aws-cli/1.16.170 Python/3.7.3 Darwin/18.6.0 botocore/1.12.160
OS (e.g: cat /etc/os-release):
[ec2-user@ip-10-30-10-164 ~]$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
Kernel (e.g. uname -a):
[ec2-user@ip-10-30-10-164 ~]$ uname -a
Linux ip-10-30-10-164.pp.meero 4.14.128-112.105.amzn2.x86_64 #1 SMP Wed Jun 19 16:53:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 13
- Comments: 66 (15 by maintainers)
Think, I’ve found out why:
externalTrafficPolicy: "Local"This by default sets up the HTTP healthcheck on a random port unless:controller.service.healthCheckNodePort
Source: https://github.com/helm/charts/tree/master/stable/nginx-ingress
I changed this:
externalTrafficPolicy: "Cluster"and had to redeploy the Service. Now the healthcheck is configured with a TCP port and I get healthy targets.So, if you don’t care about preserving the source IP address then that ^ could be your workaround, if not you must specify the healthCheckNodePort value.
After struggling with this issue for an inordinate amount of time, I think I finally found a solution that allows us to use
externalTrafficPolicy=Cluster(and thereby avoid the problems in this issue and others) while still preserving source IP.What works for us is enabling Proxy Protocol V2 (manually) for the NLB’s Target Group and then configuring nginx-ingress to use the real-ip-header from the proxy protocol.
nginx-configuration configmap
Another thing to note. I had originally changed the externalTrafficPolicy to
Localin order to preserve source IP. Once you’ve set it toLocal, you can’t simply change it backClusterwithout completely removing and re-creating the service and NLB.After making the changes above all nodes now report healthy in the Target Group (even though we’re only running nginx on a subset of those nodes) and we are seeing the correct source IP in nginx access logs.
I experienced the same problem (with CLB) and found the cause, thanks to #80579.
It happens if the name of the node the pod residing is different to the Linux hostname.
Thank you @yashwanthkalva, those steps will make it work.
However, the current behaviour is a bug. It should be possible to have both source IP and health checks. We had to manually configure NLBs pointing to NodePort services to make it work with health checks, so it is possible.
We just need it to also work when setup automatically.
related to https://github.com/kubernetes/kubernetes/issues/61486
It is correct that setting the policy is meant to change the loadbalancing itself but additionally it is officially documented to set this policy to preserve the source ip: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip
The initial posting describes a concrete configuration that should just work but it doesn’t, @mariusmarais just mentioned.
I was randomly checking out the EKS EC2 nodes and found this behaviour. 1.14 cluster with Nginx 0.26.1 - well if the controller is deployed to all nodes, then it is okay.
Spent hours on this yesterday - same issue as well - new EKS 1.14 cluster.
Same here with EKS 1.14 and Nginx 0.26.1.
Having the same issue, but with multiple services using the
LoadBalancertype with Classic andNLBon AWS EKS.@RiceBowlJr Would you help share your EKS cluster arn to me(yyyng@amazon.com), that would help debug 🤣