opa: `http.send` possible memory leak?

Short description

Our policy sends many cached http.send requests with low max-age 10s. OPA server is configured with 100M max cache size:

caching:
  inter_query_builtin_cache:
    max_size_bytes: 100000000

I noticed that memory usage reported by OPA go_memstats_alloc_bytes looks like this: image

It is similar to what k8s reports to us. Also gc reports from GODEBUG are similar.

It doesn’t look like the max cache size has any effect and eventually the OPA pod will crash on OOM. We do not experience any errors, all works as expected only the memory is still growing until pod gets killed.

Http send looks like:

    res = http.send({"method": "get",
                     "url": url,
                     "timeout": "10s",
                     "force_json_decode": true,
                     "cache": true,
                     "raise_error": false,
                     "headers": {"Authorization": sprintf("Bearer %s", [input.token])}})

Version:

FROM openpolicyagent/opa:0.45.0

Expected behavior

Memory usage should stay consistent around 100M-150M

Any ideas what could be wrong here? Thanks!

Possibly related: https://github.com/open-policy-agent/opa/issues/5320

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (9 by maintainers)

Most upvoted comments

I’ve created this issue for including the cache key in the cache size calculation.

There are two parts to the problem:

  • Keys added to c.l here are only ever removed here, which is only ever happens when the cache is full
  • c.l allows for duplicate elements

This means c.l will grow and grow until the cache becomes full or OPA reaches its memory limit and becomes oomkilled

It is easier than one might think to be affected by this issue. Consider the following setup:

  • Cache size limit of 10 MB
  • http.send has 10 second cache lifetime
  • http.send uses bearer authentication, size of JWT is approx 1KB, JWT is renewed every 5 minutes
  • http.send fetches data for individual users, each user will have a unique http.send request that is cached
  • Response size for http.send is approx 2KB

If OPA in this scenario processes 100 active users, OPA would insert 100KB worth of keys into c.l every 10 seconds, while the actual cache would insert 200KB of data every 5 minutes. 100KB every 10 seconds means the size of c.l increases by 36 MB every hour! Meanwhile c.usage increases by only 2.4MB per hour. In this case the cache should peak after approx. 4 hours at a c.l size of 144 MB where one expected the cache to use only 10 MB.

It would be great if OPA considered the size of cache keys😄