contour: Contour should not 503 traffic to a down service if there is another service available in the Route
What steps did you take and what happened:
On my K8s cluster I am using Contour ingress, running on Gimbal Cluster. 2 Contour pods running separated with 10 Envoy pods and each Envoy pod combines 2 services. Envoy pods check endpoint nodes on each service.
apiVersion: contour.heptio.com/v1beta1
kind: IngressRoute
metadata:
name: foo
namespace: foobar
spec:
virtualhost:
fqdn: foobar.com
routes:
- match: /
services:
- name: service0
port: 80
healthCheck:
path: /healthy
intervalSeconds: 3
timeoutSeconds: 1
unhealthyThresholdCount: 3
healthyThresholdCount: 5
strategy: WeightedLeastRequest
- name: service1
port: 80
strategy: WeightedLeastRequest
healthCheck:
path: /healthy
intervalSeconds: 3
timeoutSeconds: 1
unhealthyThresholdCount: 3
healthyThresholdCount: 5
After all endpoint nodes on service1 are out of order, each Envoy pod does not update its fowaring rule for service1 and returns only ‘no healthy upstream’ status.
What did you expect to happen:
If all endpoints were down on a service(service1), Envoy pods update its traffic rule to forward packets for only service0 until at least one endpoint on service1 is fine.
Anything else you would like to add:
Environment:
Contour version: v0.11.0 Kubernetes version: (use kubectl version): 1.14.0 Kubernetes installer & version: 1.14.0 Cloud provider or hardware configuration: Xeon E5-2683v4 2.10GHz / 2CPU / 512GBMEM RAM / SATA SSD 240GB x 1 / 10G Base-T*2 OS (e.g. from /etc/os-release): Ubuntu 16.04.6 LTS CNI : Calico v3.4
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 33 (20 by maintainers)
Commits related to this issue
- api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to davecheney/contour by davecheney 5 years ago
- api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to davecheney/contour by davecheney 5 years ago
- api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to davecheney/contour by davecheney 5 years ago
- api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to davecheney/contour by davecheney 5 years ago
- api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to davecheney/contour by davecheney 5 years ago
- api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to projectcontour/contour by davecheney 5 years ago
- api: move route.service.strategy to route.loadBalancerPolicy Updates #1119 Fixes #1668 This PR moves the per service lb strategy to a per route lb policy. The lb policy type provides a place to hold... — committed to davecheney/contour by davecheney 5 years ago
- api: move route.service.strategy to route.loadBalancerPolicy Updates #1119 Fixes #1668 This PR moves the per service lb strategy to a per route lb policy. The lb policy type provides a place to hold... — committed to davecheney/contour by davecheney 5 years ago
- api: move route.service.strategy to route.loadBalancerPolicy Updates #1119 Fixes #1668 This PR moves the per service lb strategy to a per route lb policy. The lb policy type provides a place to hold... — committed to davecheney/contour by davecheney 5 years ago
- api: move route.service.strategy to route.loadBalancerPolicy Updates #1119 Fixes #1668 This PR moves the per service lb strategy to a per route lb policy. The lb policy type provides a place to hold... — committed to projectcontour/contour by davecheney 5 years ago
there is a use-case where the healthcheck does not work as expected. dynamically setting the weight to 0 is a way to make it work. i agree it needs an explicit keyword in the configuration saying that the user want it to be dynamic weight for his service to not fail.
Thanks for the detailed comment Steve. I think the problem here comes from the fact that we are prioritising different users of Contour. Let me explain.
There are at least two relevant users of Contour for this discussion. They are the end user/web client/consumer of services inside the cluster, and the application owner who configures the HTTPProxy (or HTTPRoute).
This is of course the goal for the end user, but the application owner currently will expect the behavior that’s codified in the Gateway API. That’s why it’s codified in there. Noone in the Gateway API wants to have implementations serving 503s unless they have to. In this case, to respect the intent of the person that configured the object, the 503s must flow.
To put it another way, your solution serves the end user at the expense of the application owner.
Again, canary deployments and testing are the key thing to consider here. If we do something different (as you say, with exposing that something else has happened in Conditions), how do we make people aware of this? They’re used to just checking the status code of requests to determine if everything is okay. How do they build a CI test that deploys the new service at 1 percent and then checks that everything is okay? In the Condition case, that CI check now needs to not only have a HTTP client, but a full Kubernetes client, to know how to contact the apiserver, to have an identity in the cluster, and to know where to look for the config for the service. That’s a lot of extra moving parts for the exact same outcome.
What I’m saying is that, by making the end-users never have to experience 503s in these cases, that we are not serving the needs of the application owner, who arguably is more a user of Contour (since they are actually configuring it).
And I also agree that we should try our hardest not to serve 503s - unless the config the application owner has requested expects them.
In all three cases that you listed, there is a significant enough error that a human will probably need to take an action to fix something. Right now, people’s monitoring and alerting expects that the error signal they need to check is the HTTP status code.
Unless and until we have a plan for how to educate people on the interaction with using weights for advanced deployment sequences like canary testing/blue green upgrades etc, then we can’t just magically solve this.
@youngnick moved this to Prioritized Backlog because we need to address this better in docs even if it doesn’t land immediately in a release
What I’m saying is, that in the case that there are multiple, weighted services defined on a route, and one of those services has zero available endpoints, then the weighting should be respected, and 503s served at a proportional rate for that service.
So, if we have:
And
s1
has zero available endpoints, then 10 percent of requests should receive a 503 response.On its face, this seems weird. Surely we can know that that service is down, and make it so that users don’t see any problem, right? Yes, we can, but then we are not respecting what the HTTPProxy owner has asked us to do.
When someone defines weights on a service, they are saying “please send a proportion of traffic to the service”. If we decide that, because that service is down, how do they know that? The canonical way for them to represent their request is the
spec
of the object. They’ve told us what they want, and it’s our job to give them that.Does giving them what they want in this instance produce a bad outcome for end users? Arguably yes. But can we get around the fact that someone has asked us to do this? No.
I’m in agreement that we should work to prevent inadvertent breakage and footguns as far as possible. But in this case, the actual user of the software - the person who is exposing their app with Contour - has asked us for something, and it is not right to believe that we know better than them and decide that that traffic should be transparently routed somewhere else.
Now, there are some other cases as wel:
These cases are both extensions of a simpler case:
What you are proposing is for Contour to say “we know better than the person who is configuring their object, and you shouldn’t be wanting 503s ever, so we’re going to make it so you can’t”.