go: encoding/json: incorrect usage of sync.Pool

https://golang.org/cl/84897 introduces a denial-of-service attack on json.Marshal via a live-lock situation with sync.Pool.

Consider this snippet:

type CustomMarshaler int

func (c CustomMarshaler) MarshalJSON() ([]byte, error) {
	time.Sleep(500 * time.Millisecond) // simulate processing time
	b := make([]byte, int(c))
	b[0] = '"'
	for i := 1; i < len(b)-1; i++ {
		b[i] = 'a'
	}
	b[len(b)-1] = '"'
	return b, nil
}

func main() {
	processRequest := func(size int) {
		json.Marshal(CustomMarshaler(size))
		time.Sleep(1 * time.Millisecond) // simulate idle time
	}

	// Simulate a steady stream of infrequent large requests.
	go func() {
		for {
			processRequest(1 << 28) // 256MiB
		}
	}()

	// Simulate a storm of small requests.
	for i := 0; i < 1000; i++ {
		go func() {
			for {
				processRequest(1 << 10) // 1KiB
			}
		}()
	}

	// Continually run a GC and track the allocated bytes.
	var stats runtime.MemStats
	for i := 0; ; i++ {
		runtime.ReadMemStats(&stats)
		fmt.Printf("Cycle %d: %dB\n", i, stats.Alloc)
		time.Sleep(time.Second)
		runtime.GC()
	}
}

This is a variation of https://github.com/golang/go/issues/23199#issuecomment-353193866 of a situation suggested by @bcmills.

Essentially, we have a 1-to-1000 ratio of a routines that either use 1KiB or 256MiB, respectively. The occasional insertion of a 256MiB buffer into the sync.Pool gets continually held by the 1KiB routines. On my machine, after 300 GC cycles, the above program occupies 6GiB of my heap, when I expect it to be 256MiB in the worst-case.

\cc @jnjackins @bradfitz

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 39 (20 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve been thinking of ways to guarantee bounds on the worst-case behavior, and this is a prototype idea:

// bufferPool is a pool of buffers.
//
// Example usage:
//
//	p := getBuffer()
//	defer putBuffer(p)
//	p.buf = appendFoo(p.buf, foo)        // may resize buffer to arbitrarily large sizes
//	return append([]byte(nil), p.buf...) // single copy of the buffer contents
//
var bufferPool = sync.Pool{
	New: func() interface{} { return new(pooledBuffer) },
}

type pooledBuffer struct {
	buf     []byte
	strikes int // number of times the buffer was under-utilized
}

func getBuffer() *pooledBuffer {
	return bufferPool.Get().(*pooledBuffer)
}

func putBuffer(p *pooledBuffer) {
	// Recycle large buffers if sufficiently utilized.
	// If a buffer is under-utilized enough times sequentially,
	// then it is discarded, ensuring that a single large buffer
	// won't be kept alive by a continuous stream of small usages.
	switch {
	case cap(p.buf) <= 1<<16: // always recycle buffers smaller than 64KiB
		p.strikes = 0
	case cap(p.buf)/2 <= len(p.buf): // at least 50% utilization
		p.strikes = 0
	case p.strikes < 4:
		p.strikes++
	default:
		return // discard the buffer; too large and too often under-utilized
	}
	p.buf = p.buf[:0]
	bufferPool.Put(p)
}

In the worst-case scenario, we can ensure a bound on the minimal utilization based on the % threshold (e.g., 50% in the example above) and maximum number of sequential recycles below that threshold (e.g., 4x in the example above).

For the constants chosen above, the worst case sequence of utilization would be:

  • 50%, 0%, 0%, 0%, 0%, 50%, 0%, 0%, 0%, 0%, 50%, 0%, 0%, 0%, 0%, …

On average, that’s a worst case utilization of 10%, which is far better than the theoretical worst-case of 0% if the code naively recycles buffers without any limits.

Advantages of the algorithm:

  • It’s simple; easy to analyze and reason about.
  • It’s fast; only requires basic integer arithmetic.
  • It doesn’t depend on any global state where statistics are concurrently gathered.
  • For usages of all the same approximate size (regardless of how big or small), this ensures buffers are always recycled.

Happy to hear suggestions for improvements.

Another approach would be to remove the package level encodeStatePool and allow the caller to optionally supply one when creating an encoder. The reasoning here is that the caller has more knowledge about the distribution of payloads and the edge cases that a package-level statistical count would struggle with.

I’m still a bit saddened by the amount of code being added. I sent out https://go.dev/cl/471200 as a simpler alternative implementation.

Here’s the performance comparison:

name \ time/op             v0                v1             v2
LargeOutput                1.85s ± 3%        1.56s ± 3%     1.35s ± 9%
CodeEncoder                394µs ± 1%        408µs ± 2%     379µs ± 1%
CodeEncoderError           437µs ± 1%        450µs ± 1%     423µs ± 1%
AllocationWastage          396µs ± 2%        719µs ± 3%     482µs ± 4%
CodeMarshal                534µs ± 1%        722µs ± 7%     633µs ± 4%
CodeMarshalError           694µs ± 2%        905µs ± 4%     789µs ± 3%
MarshalBytes/32            170ns ± 3%        173ns ± 3%     162ns ± 2%
MarshalBytes/256           387ns ± 2%        456ns ± 1%     378ns ± 2%
MarshalBytes/4096         3.93µs ± 2%       3.98µs ± 1%    3.89µs ± 1%
MarshalBytesError/32       372µs ± 1%        361µs ± 1%     370µs ± 1%
MarshalBytesError/256      372µs ± 1%        363µs ± 1%     371µs ± 1%
MarshalBytesError/4096     379µs ± 1%        369µs ± 3%     377µs ± 2%
EncodeMarshaler           22.8ns ± 2%       25.1ns ±12%    24.7ns ± 5%
EncoderEncode             12.5ns ± 3%       13.2ns ± 1%    15.0ns ±19%

name \ wasted_b/op         v0                v1             v2
AllocationWastage          2.25M ± 0%        0.16M ± 0%     0.16M ± 0%

where:

We see that both v1 and v2 result in low allocation wastage compared to v0. This is the problem we’re trying to solve.

Comparing performance of v2 relative to v1, we get:

name                         old time/op      new time/op   delta
LargeOutput                  1.56s ± 3%       1.35s ± 9%   -13.49%  (p=0.000 n=8+10)
CodeEncoder                  408µs ± 2%       379µs ± 1%    -7.08%  (p=0.000 n=10+9)
CodeEncoderError             450µs ± 1%       423µs ± 1%    -5.92%  (p=0.000 n=8+9)
AllocationWastage            719µs ± 3%       482µs ± 4%   -32.87%  (p=0.000 n=9+10)
CodeMarshal                  722µs ± 7%       633µs ± 4%   -12.37%  (p=0.000 n=9+10)
CodeMarshalError             905µs ± 4%       789µs ± 3%   -12.80%  (p=0.000 n=10+9)
MarshalBytes/32              173ns ± 3%       162ns ± 2%    -6.37%  (p=0.000 n=10+9)
MarshalBytes/256             456ns ± 1%       378ns ± 2%   -17.03%  (p=0.000 n=10+10)
MarshalBytes/4096           3.98µs ± 1%      3.89µs ± 1%    -2.34%  (p=0.000 n=9+10)
MarshalBytesError/32         361µs ± 1%       370µs ± 1%    +2.59%  (p=0.000 n=9+10)
MarshalBytesError/256        363µs ± 1%       371µs ± 1%    +2.22%  (p=0.000 n=10+9)
MarshalBytesError/4096       369µs ± 3%       377µs ± 2%    +1.92%  (p=0.002 n=10+10)
EncodeMarshaler             25.1ns ±12%      24.7ns ± 5%      ~     (p=0.343 n=10+10)
EncoderEncode               13.2ns ± 1%      15.0ns ±19%   +13.39%  (p=0.003 n=8+10)

There’s generally a performance improvement over the more complicated https://go.dev/cl/455776 implementation. The apparent performance degradation of EncoderEncode seems to be largely in the noise.

I feel bad sending out an alternative CL after you (@lavalamp) clearly spent a lot of time on your CL. I’m okay with you taking attribution credit for the CL if you want. After all, my CL would not have come about until you spent the grunt work to prove that segmented buffers were a viable solution.

(The benchmarks for v2 were not run with the optimization to bytes.Join in https://go.dev/cl/456336)

@dsnet I like this. It’s elegant and low cost. What is the reasoning behind dropping <64k buffers?

It’s not dropping 64k buffers, but always choosing to recycle them in the pool. That specific code path isn’t strictly necessary, but for buffers below a certain size it probably doesn’t hurt to always recycle them regardless of the utilization.

An alternative, which might be helpful if we kept more sophisticated statistics, would be to nil out buf and return the pooledBuffer object to the pool. Then on Get we could pre-allocate a new buf with a better initial size.

Interesting idea. I think a useful statistic to obtain would perhaps be the maximum length encountered over the last N usages. If we evict a buffer due to low utilization, we can allocate a new one based on a recently seen maximum length.