ingress-gce: ingress-gce-404-server-with-metrics causes OOM

We encountered a scenario where 404-server-with-metrics can cause OOM exception. This is probably caused by logs being partially retained in the memory. When someone pings the cluster a lot (e.g. botnet looking for vulnerabilities) this causes a surge in the amount of log messages being written. Example:

...
I0505 11:27:49.607462 1 server-with-metrics.go:243] response 404 (backend NotFound), service rules for [ /header.html ] non-existent
I0505 11:27:49.707176 1 server-with-metrics.go:243] response 404 (backend NotFound), service rules for [ /q79w_38jg__.shtml ] non-existent
I0505 11:27:49.707220 1 server-with-metrics.go:243] response 404 (backend NotFound), service rules for [ /gk/public_html/ ] non-existent
...

Which in turn may cause the container to hit the memory limit.

/cc @mborsz

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 18 (8 by maintainers)

Most upvoted comments

In fact it’s not logs being kept in memory. Those looks good.

I have done a following experiment:

  • Modified code to add /debug/pprof/heap
  • Ran ‘curl’ test from the README.md
  • Checked docker stats (it was ~400MiB)
  • Fetched pprof for the memory and…

image

it looks like vast majority of memory is being allocated in lines https://github.com/kubernetes/ingress-gce/blob/b1a745203c5465c6a59056acc2233da37b36402e/cmd/404-server-with-metrics/server-with-metrics.go#L99-L107

It looks like on each server.idleChannel update (which happens every request) we allocate a new time.Timer which then lives for the next *idleLogTimer (1h by default)

This matches the documentation of time.After (src: https://golang.org/pkg/time/#After):

The underlying Timer is not recovered by the garbage collector until the timer fires. If efficiency is a concern, use NewTimer instead and call Timer.Stop if the timer is no longer needed.