serving: Istio-ingressgateway crashes due to OOM
What version of Knative?
0.13.0
Expected Behavior
The istio-ingressgateway pods do not keep restarting due to OOM
Actual Behavior
Istio-ingressgateway pods have many restarts and the istio-proxy container keeps fluctuating between healthy and killed due to OOM.
kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
cluster-local-gateway-85ffc48576-db5wf 1/1 Running 0 5d11h
istio-citadel-59577cd9db-rdnc5 1/1 Running 0 20d
istio-galley-559f8b47bd-qw96p 1/1 Running 0 20d
istio-ingressgateway-687d9f5f6d-x5fdt 2/2 Running 14 4d7h
The istio-ingressgateway pod shows status as
Last State: Terminated
Reason: OOMKilled
Exit Code: 0
Started: Thu, 07 May 2020 10:44:48 +0200
Finished: Thu, 07 May 2020 10:49:47 +0200
The Requests and limits set istio-proxy container on the istio-ingressgateway pod:
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 500m
memory: 256Mi
We decided to even scale up the number of ingressgateway pods:
istio-ingressgateway-7587bfb7d8-2wgkk 1/2 OOMKilled 2 59m
istio-ingressgateway-7587bfb7d8-hqlgf 1/2 Running 2 59m
istio-ingressgateway-7587bfb7d8-kzzs5 1/2 Running 3 4m32s
istio-ingressgateway-7587bfb7d8-l42zd 1/2 OOMKilled 3 59m
istio-ingressgateway-7587bfb7d8-pzgpm 1/2 Running 3 59m
istio-ingressgateway-7587bfb7d8-qzn7d 1/2 Running 3 4m32s
istio-ingressgateway-7587bfb7d8-sdf4w 1/2 CrashLoopBackOff 2 3m47s
istio-ingressgateway-7587bfb7d8-wtvst 1/2 OOMKilled 2 59m
Logs from the Istio-proxy container
2020-05-07T11:58:06.313588Z info Envoy proxy is NOT ready: failed to get server info: failed retrieving Envoy stats: Get http://127.0.0.1:15000/server_info: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-05-07T11:58:08.314413Z info Envoy proxy is NOT ready: failed to get server info: failed retrieving Envoy stats: Get http://127.0.0.1:15000/server_info: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-05-07T11:58:10.313409Z info Envoy proxy is NOT ready: failed to get server info: failed retrieving Envoy stats: Get http://127.0.0.1:15000/server_info: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-05-07T11:58:12.313437Z info Envoy proxy is NOT ready: failed to get server info: failed retrieving Envoy stats: Get http://127.0.0.1:15000/server_info: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Increased
- Requests memory to 512Mi
- Limits memory to 1Gi
While all the pods were running this was the reported usage of istio-ingressgateway pods
kubectl -n istio-system top pods | grep istio-ingressgateway
istio-ingressgateway-7cfcf45499-45kgv 170m 919Mi
istio-ingressgateway-7cfcf45499-nbxw9 371m 923Mi
istio-ingressgateway-7cfcf45499-szgrr 389m 918Mi
istio-ingressgateway-7cfcf45499-trp7g 132m 918Mi
istio-ingressgateway-7cfcf45499-xrg7d 271m 919Mi
I noticed restarts on the Istio-ingressgateway to the tune of 100s
istio-ingressgateway-7cfcf45499-45kgv 2/2 Running 192 20h
istio-ingressgateway-7cfcf45499-nbxw9 1/2 Running 186 20h
istio-ingressgateway-7cfcf45499-szgrr 1/2 Running 200 20h
istio-ingressgateway-7cfcf45499-trp7g 1/2 CrashLoopBackOff 188 19h
istio-ingressgateway-7cfcf45499-xrg7d 1/2 Running 199 20h
Increased again
resources:
limits:
cpu: "2"
memory: 1536Mi
requests:
cpu: "1"
memory: 1Gi
The memory consumed scaled up as well
kubectl -n istio-system top pods | grep istio-ingressgateway
istio-ingressgateway-687d9f5f6d-54hc6 506m 1029Mi
istio-ingressgateway-687d9f5f6d-6mck2 573m 1261Mi
istio-ingressgateway-687d9f5f6d-bdpq8 474m 1263Mi
istio-ingressgateway-687d9f5f6d-d9b99 630m 1262Mi
istio-ingressgateway-687d9f5f6d-mdz99 602m 1262Mi
istio-ingressgateway-687d9f5f6d-sp4tj 630m 1260Mi
istio-ingressgateway-687d9f5f6d-x5fdt 576m 1258Mi
I followed the logs of one of the pods from the start till it got OOM Killed, and found a bunch of bidirectional GRPC Streams. The container had 12 streams before it got killed. proxy_container_logs.log
Steps to Reproduce the Problem
–
Additional Info
Found https://github.com/istio/istio/issues/14366, where they upgrade istio-custom-bootstrap-config with overload manager options. But additionally need to set sidecar.istio.io/bootstrapOverride: "istio-custom-bootstrap-config". https://github.com/istio/istio/blob/master/samples/custom-bootstrap/example-app.yaml#L14
Any ideas on how to test out this possible solution?
Other solutions: In issue https://github.com/istio/istio/issues/22939, they are on 1.4.6 and they mention:
GRPC there are 2 KP
1. retry policy: envoy is caching the request even though GRPC is
naturally not retriable. If you have a large req, or even worse,
grpc streaming request, it's a nightmare if the retry is not disabled.
2. throttling: by default envoy will cache as many as 256 MB per
connection. If you have a slow app/peer envoy you will observe
envoy memory usage exploding. Use envoy filters to set the per
connection buffer size to see if it helps.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 18 (7 by maintainers)
stalebot kicked in since last request for info. If this is still an issue, place reopen in knative-sandbox/net-istio