kubernetes-ingress-controller: Nil targets from balancer and service name resolution failed in db-less mode when use named port

Summary

After upgrade from 1.4.2 to 2.0.2, our service went down after a while because of resolution failed: dns server error: 3 name error. When I trace down, I found several issue related to this problem:

  • get_balancer fail in our prod env after upgradation
  • it fall back to dns resolution and would never recover
  • resolution fail because it use wrong service name when use named port in ingress e.g. Screen Shot 2020-03-01 at 12 35 22 AM

Screen Shot 2020-03-01 at 12 34 35 AM

kong would treat named port as part of service name when synced by ingress-controller and Ooops…

it generate the same target service name on 1.4.2 but seems it always able to get target from balancer, so the problem never trigger before.

Steps To Reproduce

Can’t reproduce get balancer failure unless put in our prod env (tried bench tools…) but for the resolution problem:

  1. create a ingress with named servicePort (e.g. http instead of 80)
  2. you got the problematic service name

Additional Details & Logs

You can check the details of dns failure I posted in Kong/kong#5455

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 25 (14 by maintainers)

Most upvoted comments

@hbagdi what minimum versions of kong and kong-ingress-controller fix this problem? We moved to 2.1.4 / 0.9.0 due to https://github.com/Kong/kong/pull/5831 talking about fixing this in 2.1.0 and @hishamhm also mentioning 2.1.0-alpha1 above, but we still get into this exceptional case. Note: we are running kong in db-mode.

Separate from all of the related issues to this mentioning why balancer can possibly be null sometimes, I’m confused why the fallback isn’t able to resolve to a valid name when kong is running in k8s? Here’s an example of the 3 name error that we get.

(short)service-application-kong.apps.80.svc:(na)
service-application-kong.apps.80.svc.kong.svc.cluster.local:33
service-application-kong.apps.80.svc.svc.cluster.local:33
service-application-kong.apps.80.svc.cluster.local:33
...

Two examples that would have worked are below, but the fallback never tries these since the target ends up with the port present. I’m having trouble finding where in the ingress controller is adding that port.

service-application-kong.apps.svc.cluster.local

@carnei-ro That PR is not in because it caused other errors in the next branch, and the team later agreed it wasn’t an ideal solution for master either.

The balancer code in next (and by extension 2.1.0-alpha1) contains many changes, plus we pushed fixes to the DB-less configuration loading logic, which together may cause the issue to not happen in the 2.1 branch.

We haven’t had confirmation from any users that the issue persists in 2.1.0-alpha1 (so it may be fixed already!), but we continue to investigate this, and if you have any info, please let us know!