contour: Contour should not 503 traffic to a down service if there is another service available in the Route

What steps did you take and what happened:

On my K8s cluster I am using Contour ingress, running on Gimbal Cluster. 2 Contour pods running separated with 10 Envoy pods and each Envoy pod combines 2 services. Envoy pods check endpoint nodes on each service.

apiVersion: contour.heptio.com/v1beta1
kind: IngressRoute
metadata:
  name: foo
  namespace: foobar
spec:
  virtualhost:
    fqdn: foobar.com
  routes:
    - match: /
      services:
        - name: service0
          port: 80
          healthCheck:
            path: /healthy
            intervalSeconds: 3
            timeoutSeconds: 1
            unhealthyThresholdCount: 3
            healthyThresholdCount: 5
          strategy: WeightedLeastRequest
        - name: service1
          port: 80
          strategy: WeightedLeastRequest
          healthCheck:
            path: /healthy
            intervalSeconds: 3
            timeoutSeconds: 1
            unhealthyThresholdCount: 3
            healthyThresholdCount: 5

After all endpoint nodes on service1 are out of order, each Envoy pod does not update its fowaring rule for service1 and returns only ‘no healthy upstream’ status.

What did you expect to happen:

If all endpoints were down on a service(service1), Envoy pods update its traffic rule to forward packets for only service0 until at least one endpoint on service1 is fine.

Anything else you would like to add:

Environment:

Contour version: v0.11.0 Kubernetes version: (use kubectl version): 1.14.0 Kubernetes installer & version: 1.14.0 Cloud provider or hardware configuration: Xeon E5-2683v4 2.10GHz / 2CPU / 512GBMEM RAM / SATA SSD 240GB x 1 / 10G Base-T*2 OS (e.g. from /etc/os-release): Ubuntu 16.04.6 LTS CNI : Calico v3.4

About this issue

Original URL
State: open
Created 5 years ago
Comments: 33 (20 by maintainers)

Commits related to this issue

api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to davecheney/contour by davecheney 5 years ago
api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to davecheney/contour by davecheney 5 years ago
api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to davecheney/contour by davecheney 5 years ago
api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to davecheney/contour by davecheney 5 years ago
api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to davecheney/contour by davecheney 5 years ago
api: move HTTPProxy health check to route level Updates #1119 This PR moves the HealthCheck stanza from the individual service level to the route. This is an insurance policy if we have to move from... — committed to projectcontour/contour by davecheney 5 years ago
api: move route.service.strategy to route.loadBalancerPolicy Updates #1119 Fixes #1668 This PR moves the per service lb strategy to a per route lb policy. The lb policy type provides a place to hold... — committed to davecheney/contour by davecheney 5 years ago
api: move route.service.strategy to route.loadBalancerPolicy Updates #1119 Fixes #1668 This PR moves the per service lb strategy to a per route lb policy. The lb policy type provides a place to hold... — committed to davecheney/contour by davecheney 5 years ago
api: move route.service.strategy to route.loadBalancerPolicy Updates #1119 Fixes #1668 This PR moves the per service lb strategy to a per route lb policy. The lb policy type provides a place to hold... — committed to davecheney/contour by davecheney 5 years ago
api: move route.service.strategy to route.loadBalancerPolicy Updates #1119 Fixes #1668 This PR moves the per service lb strategy to a per route lb policy. The lb policy type provides a place to hold... — committed to projectcontour/contour by davecheney 5 years ago

Most upvoted comments

there is a use-case where the healthcheck does not work as expected. dynamically setting the weight to 0 is a way to make it work. i agree it needs an explicit keyword in the configuration saying that the user want it to be dynamic weight for his service to not fail.

tannevaled on Nov 16, 2021

Thanks for the detailed comment Steve. I think the problem here comes from the fact that we are prioritising different users of Contour. Let me explain.

There are at least two relevant users of Contour for this discussion. They are the end user/web client/consumer of services inside the cluster, and the application owner who configures the HTTPProxy (or HTTPRoute).

I’m in the boat that we should try and always serve as much traffic as we can.

This is of course the goal for the end user, but the application owner currently will expect the behavior that’s codified in the Gateway API. That’s why it’s codified in there. Noone in the Gateway API wants to have implementations serving 503s unless they have to. In this case, to respect the intent of the person that configured the object, the 503s must flow.

To put it another way, your solution serves the end user at the expense of the application owner.

Again, canary deployments and testing are the key thing to consider here. If we do something different (as you say, with exposing that something else has happened in Conditions), how do we make people aware of this? They’re used to just checking the status code of requests to determine if everything is okay. How do they build a CI test that deploys the new service at 1 percent and then checks that everything is okay? In the Condition case, that CI check now needs to not only have a HTTP client, but a full Kubernetes client, to know how to contact the apiserver, to have an identity in the cluster, and to know where to look for the config for the service. That’s a lot of extra moving parts for the exact same outcome.

What I’m saying is that, by making the end-users never have to experience 503s in these cases, that we are not serving the needs of the application owner, who arguably is more a user of Contour (since they are actually configuring it).

Even the GatewayAPI docs agree with that (https://gateway-api.sigs.k8s.io/concepts/guidelines/).

And I also agree that we should try our hardest not to serve 503s - unless the config the application owner has requested expects them.

In all three cases that you listed, there is a significant enough error that a human will probably need to take an action to fix something. Right now, people’s monitoring and alerting expects that the error signal they need to check is the HTTP status code.

Unless and until we have a plan for how to educate people on the interaction with using weights for advanced deployment sequences like canary testing/blue green upgrades etc, then we can’t just magically solve this.

youngnick on Nov 30, 2021

@youngnick moved this to Prioritized Backlog because we need to address this better in docs even if it doesn’t land immediately in a release

xaleeks on Nov 23, 2021

Are you suggesting @youngnick that if a cluster fails all of it’s health checks, then traffic should still be routed to it? Or all endpoints are unhealthy (e.g. fail readiness probe), then traffic should still route? If that’s the case then health checks should be removed from Contour and it should be documented that traffic will just fail.

What I’m saying is, that in the case that there are multiple, weighted services defined on a route, and one of those services has zero available endpoints, then the weighting should be respected, and 503s served at a proportional rate for that service.

So, if we have:

# rest of the proxy left out
spec:
  virtualhost:
    fqdn: weights.bar.com
  routes:
    - conditions:
      - prefix: /
      services:
        - name: s1
          port: 80
          weight: 10
        - name: s2
          port: 80
          weight: 90

And s1 has zero available endpoints, then 10 percent of requests should receive a 503 response.

On its face, this seems weird. Surely we can know that that service is down, and make it so that users don’t see any problem, right? Yes, we can, but then we are not respecting what the HTTPProxy owner has asked us to do.

When someone defines weights on a service, they are saying “please send a proportion of traffic to the service”. If we decide that, because that service is down, how do they know that? The canonical way for them to represent their request is the spec of the object. They’ve told us what they want, and it’s our job to give them that.

Does giving them what they want in this instance produce a bad outcome for end users? Arguably yes. But can we get around the fact that someone has asked us to do this? No.

I’m in agreement that we should work to prevent inadvertent breakage and footguns as far as possible. But in this case, the actual user of the software - the person who is exposing their app with Contour - has asked us for something, and it is not right to believe that we know better than them and decide that that traffic should be transparently routed somewhere else.

Now, there are some other cases as wel:

You have two weighted services, and some endpoints of one of them is not available. In this case, it’s part of our contract that we send a proportion of overall requests to the endpoints we have available. Contour has no business making a guess at how many endpoints should be available, so we need to send the traffic to the ones we have.

These cases are both extensions of a simpler case:

You have only one service, and all its endpoints are unavailable. In this case, we will end up serving a 503 for all requests. I don’t think this is unreasonable.
you have only one service, and some of its endpoints are unavailable. In this case, we should send traffic to the endpoints we have available.

What you are proposing is for Contour to say “we know better than the person who is configuring their object, and you shouldn’t be wanting 503s ever, so we’re going to make it so you can’t”.

youngnick on Nov 15, 2021