caddy: Memory/cpu leak in proxy directive when max_fails 0 and backend fails

1. What version of Caddy are you running (caddy -version)?

v0.9.1

2. What are you trying to do?

Serving traffic via the proxy directive using max_fails 0 as only one backend is available.

3. What is your entire Caddyfile?

:2015 {
  proxy / 127.0.0.1:8080 {
    transparent
    max_fails 0
  }
}

4. How did you run Caddy (give the full command and describe the execution environment)?

./caddy --agree --conf=caddyfile_test

5. What did you expect to see?

No memory leak and no 100% CPU usage on 4 cores.

6. What did you see instead (give full error messages and/or log)?

4 cores at 100%, 4GB memory filled in under 1 minute after the first backend failure arose.

Also stdout seems to go crazy with:

goroutine 2826341 [sleep, 2 minutes]:

time.Sleep(0x2540be400)

        /usr/local/go/src/runtime/time.go:59 +0xe1

github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc4202c5420, 0x2540be400)

        /tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b

created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP

        /tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e



goroutine 2831798 [sleep, 2 minutes]:

time.Sleep(0x2540be400)

        /usr/local/go/src/runtime/time.go:59 +0xe1

github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP.func1(0xc4202c5420, 0x2540be400)

        /tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:157 +0x2b

created by github.com/mholt/caddy/caddyhttp/proxy.Proxy.ServeHTTP

        /tmp/custombuild_111_575206995/src/github.com/mholt/caddy/caddyhttp/proxy/proxy.go:159 +0x44e

7. How can someone who is starting from scratch reproduce this behavior as minimally as possible?

  • use proxy with max_fails 0
  • have a high traffic server
  • force a backend unreachable error
  • see how it goes crazy

As always edge cases \o/

The fix for now was to use max_fails to 50, which doesn’t get reached under usual circumstances.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 16 (6 by maintainers)

Most upvoted comments

I feel like people’s backend services are more flaky than one thinks… just because you can connect once with telnet doesn’t mean that amid the thousands of requests before that, one request didn’t fail.

I do agree we need better error reporting in this case.

@nemothekid After re-reading your “thinking out loud” post above, what if we make a few changes:

  • An upstream is marked as down only if there is a connection error/timeout, or 5xx error response, and only if there is at least one other healthy backend.
  • If there is an error, we log the actual error message rather than just “backend unreachable”.
  • max_fails 0 is not an acceptable value, ever – we just throw an error at loading time.

^ These are just the way I’m leaning so far. What do you (anyone!) think?

Something conceptually like:

diff --git a/proxy.go b/proxy2.go
index 89fa21a..c5c05c3 100644
--- a/proxy.go
+++ b/proxy2.go
@@ -157,6 +157,9 @@ func (p Proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) (int, error) {
                        time.Sleep(timeout)
                        atomic.AddInt32(&host.Fails, -1)
                }(host, timeout)
+               if su, ok := upstream.(*staticUpstream); ok && su.MaxFails == 0 {
+                       return http.StatusBadGateway, errUnreachable
+               }
        }

        return http.StatusBadGateway, errUnreachable

Yeah the stdout errors are caused by it launching as many goroutines as possible in that infinite loop (which explains why its eating memory).

max_fails 0 likely shouldn’t be a valid directive, max_fails 1 should be the minimum.

Or we can change the meaning of max_fails 0. If the backend fails and max_fails is 0, we immediately return http.StatusBadGateway, but we don’t mark the backend down (so the next request can still try that backend).