istio: Info: Everything we do on 1.0.6 to minimise 503s

I’ve had several chats with people recently so I’m putting there here to capture everything we do on 1.0.6 to deal with 503’s, and then people can tell me what shouldn’t be required as of 1.1.x.

The combination of all these things we see little 503’s in the mesh, and basically none at the edge.

On the VirtualService for every service, we configure the envoy retry headers:

spec:
  http:
  - appendHeaders:
      x-envoy-max-retries: "10"
      x-envoy-retry-on: gateway-error,connect-failure,refused-stream

On the DestinationRule for high QPS (3k/sec over 6 pods) applications, we configure outlier detection, this would result in 400-500errors to 2-5, during a pod restart.

spec:
  trafficPolicy
    outlierDetection:
      maxEjectionPercent: 50
      baseEjectionTime: 30s
      consecutiveErrors: 5
      interval: 30s

On the pods, we configure the application container to have a preStop sleep, which gives time for the unready state of the pod (during termination) to populate to other envoys and the traffic to drain:

    lifecycle:
      preStop:
        exec:
          command:
          - sleep
          - "10"

On envoy, we have a custom pre-stop hook that waits for the primary application to stop listening:

#!/bin/bash
set -e
if ! pidof envoy &>/dev/null; then
  exit 0
fi

if ! pidof pilot-agent &>/dev/null; then
  exit 0
fi

while [ $(netstat -plunt | grep tcp | grep -v envoy | wc -l | xargs) -ne 0 ]; do
  sleep 3;
done
exit 0

In the istio config we do policyCheckFailOpen: true

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 11
Comments: 25 (10 by maintainers)

Most upvoted comments

For those following this thread, we’ve removed pretty much everything we’ve done apart from the lifecycle preStop (so no headers or outlier detection now), implemented Sidecar to massively limit the scope of pushes (our push durations have dropped from 30s to 2s) and in 1.1.1 and we’re not seeing any 503s.

Stono on Mar 27, 2019

Yeah I build a custom image for the sidecar with the pre stop script in it, and then I modify the injector template to have the preStop hook

On Sat, 2 Mar 2019, 9:33 am Stefan Prodan, notifications@github.com wrote:

How do you add the pre-stop hook to Envoy? Can it be done in the injection webhook config?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/istio/istio/issues/12183#issuecomment-468904212, or mute the thread https://github.com/notifications/unsubscribe-auth/ABaviUvUVH-ftcl7NVVH669VkrKJJ4A4ks5vSkVogaJpZM4bZdzt .

Stono on Mar 2, 2019

@Stono can I ask for a bit of clarification? Sorry if this clear to everyone else but there are enough moving parts that I’d like to make sure I get this right. When you say you did lifecycle preStop does that mean that you added the preStop to all of your app containers? Or does that go somewhere else?

dbyron0 on Apr 5, 2019

We are actually going to add the preStop hook to all the deployments in our cluster because regardless using istio or not, there is a chance of request drop during rolling update of these deployments.

@iandyh I haven’t encountered it yet myself but to confirm - you’re talking about the case when kubeproxy + iptables + Service Endpoints not catching up pod terminations fast enough?

The issue we are discussing actually is entirely different from the main thread. You can read the details here: https://freecontent.manning.com/handling-client-requests-properly-with-kubernetes/

iandyh on Mar 23, 2019

@mumoshu sidecar actually does not talk to gateway since gateway only route the traffic into the cluster.

We are actually going to add the preStop hook to all the deployments in our cluster because regardless using istio or not, there is a chance of request drop during rolling update of these deployments.

iandyh on Mar 22, 2019