serving: Istio-ingressgateway crashes due to OOM

What version of Knative?

0.13.0

Expected Behavior

The istio-ingressgateway pods do not keep restarting due to OOM

Actual Behavior

Istio-ingressgateway pods have many restarts and the istio-proxy container keeps fluctuating between healthy and killed due to OOM.

kubectl get pods -n istio-system
NAME                                      READY   STATUS    RESTARTS   AGE
cluster-local-gateway-85ffc48576-db5wf    1/1     Running   0          5d11h
istio-citadel-59577cd9db-rdnc5            1/1     Running   0          20d
istio-galley-559f8b47bd-qw96p             1/1     Running   0          20d
istio-ingressgateway-687d9f5f6d-x5fdt     2/2     Running   14         4d7h

The istio-ingressgateway pod shows status as

Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    0
      Started:      Thu, 07 May 2020 10:44:48 +0200
      Finished:     Thu, 07 May 2020 10:49:47 +0200

The Requests and limits set istio-proxy container on the istio-ingressgateway pod:

 Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      500m
      memory:   256Mi

We decided to even scale up the number of ingressgateway pods:

istio-ingressgateway-7587bfb7d8-2wgkk     1/2     OOMKilled          2          59m
istio-ingressgateway-7587bfb7d8-hqlgf     1/2     Running            2          59m
istio-ingressgateway-7587bfb7d8-kzzs5     1/2     Running            3          4m32s
istio-ingressgateway-7587bfb7d8-l42zd     1/2     OOMKilled          3          59m
istio-ingressgateway-7587bfb7d8-pzgpm     1/2     Running            3          59m
istio-ingressgateway-7587bfb7d8-qzn7d     1/2     Running            3          4m32s
istio-ingressgateway-7587bfb7d8-sdf4w     1/2     CrashLoopBackOff   2          3m47s
istio-ingressgateway-7587bfb7d8-wtvst     1/2     OOMKilled          2          59m

Logs from the Istio-proxy container

2020-05-07T11:58:06.313588Z	info	Envoy proxy is NOT ready: failed to get server info: failed retrieving Envoy stats: Get http://127.0.0.1:15000/server_info: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-05-07T11:58:08.314413Z	info	Envoy proxy is NOT ready: failed to get server info: failed retrieving Envoy stats: Get http://127.0.0.1:15000/server_info: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-05-07T11:58:10.313409Z	info	Envoy proxy is NOT ready: failed to get server info: failed retrieving Envoy stats: Get http://127.0.0.1:15000/server_info: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-05-07T11:58:12.313437Z	info	Envoy proxy is NOT ready: failed to get server info: failed retrieving Envoy stats: Get http://127.0.0.1:15000/server_info: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Increased

Requests memory to 512Mi
Limits memory to 1Gi

While all the pods were running this was the reported usage of istio-ingressgateway pods

kubectl -n istio-system top pods | grep istio-ingressgateway
istio-ingressgateway-7cfcf45499-45kgv     170m         919Mi
istio-ingressgateway-7cfcf45499-nbxw9     371m         923Mi
istio-ingressgateway-7cfcf45499-szgrr     389m         918Mi
istio-ingressgateway-7cfcf45499-trp7g     132m         918Mi
istio-ingressgateway-7cfcf45499-xrg7d     271m         919Mi

I noticed restarts on the Istio-ingressgateway to the tune of 100s

istio-ingressgateway-7cfcf45499-45kgv     2/2     Running            192        20h
istio-ingressgateway-7cfcf45499-nbxw9     1/2     Running            186        20h
istio-ingressgateway-7cfcf45499-szgrr     1/2     Running            200        20h
istio-ingressgateway-7cfcf45499-trp7g     1/2     CrashLoopBackOff   188        19h
istio-ingressgateway-7cfcf45499-xrg7d     1/2     Running            199        20h

Increased again

resources:
          limits:
            cpu: "2"
            memory: 1536Mi
          requests:
            cpu: "1"
            memory: 1Gi

The memory consumed scaled up as well

kubectl -n istio-system top pods | grep istio-ingressgateway
istio-ingressgateway-687d9f5f6d-54hc6     506m         1029Mi
istio-ingressgateway-687d9f5f6d-6mck2     573m         1261Mi
istio-ingressgateway-687d9f5f6d-bdpq8     474m         1263Mi
istio-ingressgateway-687d9f5f6d-d9b99     630m         1262Mi
istio-ingressgateway-687d9f5f6d-mdz99     602m         1262Mi
istio-ingressgateway-687d9f5f6d-sp4tj     630m         1260Mi
istio-ingressgateway-687d9f5f6d-x5fdt     576m         1258Mi

I followed the logs of one of the pods from the start till it got OOM Killed, and found a bunch of bidirectional GRPC Streams. The container had 12 streams before it got killed. proxy_container_logs.log

Steps to Reproduce the Problem

–

Additional Info

Found https://github.com/istio/istio/issues/14366, where they upgrade istio-custom-bootstrap-config with overload manager options. But additionally need to set sidecar.istio.io/bootstrapOverride: "istio-custom-bootstrap-config". https://github.com/istio/istio/blob/master/samples/custom-bootstrap/example-app.yaml#L14

Any ideas on how to test out this possible solution?

Other solutions: In issue https://github.com/istio/istio/issues/22939, they are on 1.4.6 and they mention:

GRPC there are 2 KP

1. retry policy: envoy is caching the request even though GRPC is 
naturally not retriable. If you have a large req, or even worse, 
grpc streaming request, it's a nightmare if the retry is not disabled.

2. throttling: by default envoy will cache as many as 256 MB per 
connection. If you have a slow app/peer envoy you will observe 
envoy memory usage exploding. Use envoy filters to set the per 
connection buffer size to see if it helps.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 18 (7 by maintainers)

Most upvoted comments

stalebot kicked in since last request for info. If this is still an issue, place reopen in knative-sandbox/net-istio

mattmoor on Nov 16, 2020