istio: Info: Everything we do on 1.0.6 to minimise 503s
I’ve had several chats with people recently so I’m putting there here to capture everything we do on 1.0.6 to deal with 503’s, and then people can tell me what shouldn’t be required as of 1.1.x.
The combination of all these things we see little 503’s in the mesh, and basically none at the edge.
On the VirtualService
for every service, we configure the envoy retry headers:
spec:
http:
- appendHeaders:
x-envoy-max-retries: "10"
x-envoy-retry-on: gateway-error,connect-failure,refused-stream
On the DestinationRule for high QPS (3k/sec over 6 pods) applications, we configure outlier detection, this would result in 400-500errors to 2-5, during a pod restart.
spec:
trafficPolicy
outlierDetection:
maxEjectionPercent: 50
baseEjectionTime: 30s
consecutiveErrors: 5
interval: 30s
On the pods, we configure the application container to have a preStop sleep, which gives time for the unready state of the pod (during termination) to populate to other envoys and the traffic to drain:
lifecycle:
preStop:
exec:
command:
- sleep
- "10"
On envoy, we have a custom pre-stop hook that waits for the primary application to stop listening:
#!/bin/bash
set -e
if ! pidof envoy &>/dev/null; then
exit 0
fi
if ! pidof pilot-agent &>/dev/null; then
exit 0
fi
while [ $(netstat -plunt | grep tcp | grep -v envoy | wc -l | xargs) -ne 0 ]; do
sleep 3;
done
exit 0
In the istio config we do policyCheckFailOpen: true
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 11
- Comments: 25 (10 by maintainers)
For those following this thread, we’ve removed pretty much everything we’ve done apart from the lifecycle preStop (so no headers or outlier detection now), implemented
Sidecar
to massively limit the scope of pushes (our push durations have dropped from 30s to 2s) and in 1.1.1 and we’re not seeing any 503s.Yeah I build a custom image for the sidecar with the pre stop script in it, and then I modify the injector template to have the preStop hook
On Sat, 2 Mar 2019, 9:33 am Stefan Prodan, notifications@github.com wrote:
@Stono can I ask for a bit of clarification? Sorry if this clear to everyone else but there are enough moving parts that I’d like to make sure I get this right. When you say you did
lifecycle preStop
does that mean that you added the preStop to all of your app containers? Or does that go somewhere else?The issue we are discussing actually is entirely different from the main thread. You can read the details here: https://freecontent.manning.com/handling-client-requests-properly-with-kubernetes/
@mumoshu
sidecar
actually does not talk togateway
sincegateway
only route the traffic into the cluster.We are actually going to add the
preStop
hook to all the deployments in our cluster because regardless using istio or not, there is a chance of request drop during rolling update of these deployments.