jaeger: jaeger-agent reproducible memory leak in 1.21.0

Describe the bug I am observing very high and rapidly increasing memory usage of jaeger-agent which may be a memory leak. Eventually the agent (container) may run out of memory and crash.

I am able to reproduce the behavior reliably. It is happening at very low span rates of <= 30 or 50 Spans/sec according to jaeger_collector_spans_received_total

I am using a dev setup running Demo ASP.NET Core Webservices, using opentelemetry-dotnet for instrumentation. Since these are dummy projects in a home lab environment, I am able to provide the full source code of the .NET solution if necessary.

Possible Cause & Fix

https://github.com/open-telemetry/opentelemetry-dotnet/issues/1372 It looks like this problem can be fixed by by using MaxPayloadSizeInBytes = 65000; which was the default until mid september.

Is this memory consumption by jaeger-agent expected behavior if a client library misbehaves? Or is this something the jaeger team would like to investigate?


I am observing this behavior running jaeger-all-in-one native on windows. or in a linux container on DockerDesktop WSL2, or in a Linux Hyper-V VM. At first I was using and blaming badger local storage. I then switched to elasticsearch storage. I have no split up to running agent, collector and query containers on WSL 2 so I can pinpoint the memory usage to agent.

Agent is currently not on localhost where instrumented client application is running, but I tried this also and the issue happend too. Will try this again now that I am no longer using all in one image.

The issue does not seem to occur under very light load. I am curl’ing my services to generate spans. At first memory is stable and low. Then I started curl’ing in 10 parallel loops increasing span creation rate.

After some minutes agents memory jumps from < 50 MB to > 2GB and then > 5 GB. The container has currently a hard memory limit of mem_limit: 8000m.

grafik At the current moment it sites “stable” at 4.6 GB but I have seen it go beyond 8 GB as well.

A symptom or maybe the cause for this are log errors starting to appear in agent logs. While running up to about 3 curl loops there are no log messages. A litte more requests and these start trickling in:

jaeger-agent        | {"level":"error","ts":1605909716.7062023,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: error reading list begin: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
jaeger-agent        | {"level":"error","ts":1605909751.7633333,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 2 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
jaeger-agent        | {"level":"error","ts":1605909761.783476,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Required field TraceIdLow is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}      
jaeger-agent        | {"level":"error","ts":1605909771.80936,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading 
struct: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
jaeger-agent        | {"level":"error","ts":1605909791.8287015,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 8 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}

These 5 log errors correlate with metric

jaeger_agent_thrift_udp_t_processor_handler_errors_total{protocol="compact",model="jaeger"}

grafik

I am not sure if the instrumentation library is to blame and this is a concurrency issue there. If requests are reduced, no more log errors are happening, memory is stable (16 MB).

When increasing request load error rate increases again, and at some point memory jumps a few gigabytes:

grafik

I took 300 log lines and de-duplicated them a bit to these 50:

":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 6 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Required field TraceIdLow is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Log error reading struct: Required field Timestamp is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Invalid data length","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: error reading list begin: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 8 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 9 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 8 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 11 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Unknown data type 0","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 8 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 2 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 2 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 6 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 23 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 12 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 15 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: EOF","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 22 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 6 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.SpanRef error reading struct: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.SpanRef error reading struct: *jaeger.SpanRef field 6 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field -24 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 25 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 12 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 20 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 33 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 14 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 5 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: error reading list begin: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.SpanRef error reading struct: *jaeger.SpanRef field 5 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 10 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 13 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 12 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 13 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.SpanRef error reading struct: Required field RefType is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 4 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 10 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 25 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 16 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 13 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 5 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 18 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 11 read error: don't know what type: 15","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Log error reading struct: *jaeger.Log field 9 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 19 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 25 read error: don't know what type: 13","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 22 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}

To Reproduce Steps to reproduce the behavior:

  1. Run ASP.NET Core 3.1 Demo Solution using opentelemetry instrumentation
  2. Create load to generate spans
  3. jaeger-agent memory increase by some gigabyte and agent may crash.

If the logs / other steps are not enough I should be able to provide you with a ready to run docker-compose setup, but it will take me some time.

Expected behavior jaeger-agent should not consume this much memory with such little load.

Screenshots See above.

Version (please complete the following information):

  • OS: Windows 10 2020H2 + Docker Desktop 2.5.0.1 on WSL 2` or natively on windows.
  • Jaeger version: 1.20.0 and 1.20.1
  • Deployment: agent on bare metal windows, or WSL2 linux container, or linux vm. Agent is currently not on localhost where instrumented client application is running, but I tried this also and the issue happend too.
  • opentelemetry-dotnet version 1.0.0-RC1 or 0.8.0-beta

What troubleshooting steps did you try? Try to follow https://www.jaegertracing.io/docs/latest/troubleshooting/ and describe how far you were able to progress and/or which steps did not work.

I did enable debug logging for agent and collector, files attached. For this run, agent memory did only increase up until 1.9 GB. jaeger-agent-debug.log jaeger-collector-debug.log

Not sure which other steps would apply. I am no go dev, so using debug images would be of no use.

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 6
  • Comments: 56 (32 by maintainers)

Most upvoted comments

Thanks @fishy that helped. @jpkrohling I was able to create a build with the thrift fix and had it run overnight to see if it fixed it and it most certainly did. The 1.22 version collectors oom multiple times an hour, the build hasn’t been killed and runs with a stable memory usage, whereas the others would have 8-15gb spikes and be killed off. I’m happy to verify again with whatever you give me.

I agree, but this sounds like a bug(ish) in the Thrift library, if it just blindly allocates that much memory based on a malformed value from the payload.

Quick status update: I was able to reproduce this issue, and I’m convinced it’s the same as #2452. The code where this is mostly observable is here:

https://github.com/apache/thrift/blob/b75e88a33d67ae05ef9b5fa001d2a63a2effe377/lib/go/thrift/compact_protocol.go#L557-L576

I’m sure the root cause is elsewhere, and I’m still investigating, but wanted to share some findings in case someone has an idea on what’s going on:

  1. most of the time, this function runs fine, but given enough retries of the reproducing code, memory will just explode
  2. when the memory explodes, the usage at the beginning of this function is typically quite low. One example execution: the agent was using 3MiB of memory before the buffer allocation (lines 570-574 in the code above), and 1987MiB after that.
  3. the length (L558) was 2081129901 for that case
  4. the total UDP package payload length was 4083 (https://github.com/jaegertracing/jaeger/blob/1b209472dc6a35f7de90c7a9048cb4e960cc2335/cmd/agent/app/processors/thrift_processor.go#L118). Here’s where I’m confused: as far as I understand, this payload is what will then be read at the protocol code, so, the length (previous point) would have to be lower than the payload size
  5. I might need to read more about Go’s memory management, but the heap part of pprof shows a value close to the buffer size for the line 576 in compact_protocol. Does it mean that the buffer is still in memory, or will Go reclaim this space?

cc @yurishkuro, as you might be able to confirm that my expectation on item 4 is correct.

Oh, and thank you again @mberginbh! For the first time in about 15 years, I’m able to C# again!

I believe @rubenvp8510 is working on backporting @fishy’s fix to the version of Thrift that Jaeger uses. After that, we’ll propose releasing 1.23, which would then fix this issue here once and for all (hopefully) 😃

I was also seeing these memory issues frequently with the 1.18 build of the agent. As per the discussion, updating to the 1.22 version went a long way to addressing the issue although there were still occasional spikes. I have made the modification recommended by @fishy and deployed, and we have not seen a spike for over 48 hours.

Thanks for the info. The detail in this discussion was really informative.

A release would be half year away (we plan to do 0.15.0 release around September and there’s no plan to do 0.14.2 release at this time, and this bug alone is unlikely to change that).

We can probably use what we did before 0.14.0, in that we forked the Thrift repository and tagged v1.13.0-patch1 with your first fix. References:

https://github.com/jaegertracing/thrift https://github.com/jaegertracing/jaeger/pull/2780

Also I’d still strongly suggest y’all to set a lower MaxMessageSize.

Will do, thanks! In theory, this could be the same as the max size of a UDP packet, right? I would set it to a couple of MiBs just to be on the safe side.

@jpkrohling no I can’t reproduce it myself. I run it as a service and somebody is probably sending something bad. But it is consistently getting oom killed. Any tips on how to track down the spans at a collector level? I can only think of building a fork with some more logging.

Or if you can point me to branch with the apache/thrift#2359 I can build and verify that it fixes it.

@forestsword You can run go mod edit --replace=github.com/apache/thrift=github.com/fishy/thrift@1bc6892 && go mod tidy to use the code from the PR. You do need to pass in a TConfiguration with a reasonable MaxMessageSize set to actually limit the memory usage.

The PR actually also contains compiler changes and go mod edit would only replace the library code, but the library code (with reasonable MaxMessageSize configuration) should be enough to improve the situation.

I’m not sure if this is the same issue, but I hope so. I see very large allocations (using v1.22.0 of the jaeger-collector) as well but on the TraceIdHigh here:

prof

Which is an int64:

https://github.com/jaegertracing/jaeger/blob/78019b6849133bdedebe35d2a28fe00cce875958/thrift-gen/jaeger/jaeger.go#L1060-L1070

Yes, we need a numbered thrift release.

The payload itself was within bounds (4083). While the payload was indeed malformed, we shouldn’t just explode in memory consumption when receiving bad data.

How can I run this on Linux?

@jpkrohling If you don’t feel like installing .NET, I have pushed the sample app to https://github.com/Mario-Hofstaetter/jaegertracing-jaeger-issue-2638 including a Dockerfile and was able to reproduce on https://labs.play-with-docker.com . Thanks @mberginbh for the tiny example.

You can point the app to any agent adress/port and change the loopCount to change the load.

Steps to reproduce

$ git clone https://github.com/Mario-Hofstaetter/jaegertracing-jaeger-issue-2638.git && cd jaegertracing-jaeger-issue-2638
$ docker-compose up -d
# takes some time for elasticsearch to spin up, so collector and query-ui containers will be restarting
# c# demo app image is build with tag 'jaegerdemoclient'

Run demo app:

# Repeat a few times
$ docker-compose up jaegerdemoclient
Starting jaegerdemoclient ... done
Attaching to jaegerdemoclient
jaegerdemoclient    | Hello World!
jaegerdemoclient    | Using agenthost 'jaeger-agent:6831'
jaegerdemoclient    | Running  '500' loops
jaegerdemoclient    | Starting loop
jaegerdemoclient    | Loop done, wait 5 seconds because it seems not everything gets flushed
jaegerdemoclient exited with code 0

Memory usage increases:

$ docker stats --no-stream
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
c178db9e972d        jaeger-agent        0.00%               2.095GiB / 31.42GiB   6.67%               445kB / 103kB       0B / 0B             11

Change agent adress or loopcount

using compose:

$ docker-compose run -e "DEMOCLIENT_LOOPCOUNT=5000" -e "JAEGER_AGENTHOST=my-dev-machine" jaegerdemoclient
Hello World!
Using agenthost 'my-dev-machine:6831'
Running  '1000' loops

without compose:

docker run --rm -it -e "DEMOCLIENT_LOOPCOUNT=5000" -e "JAEGER_AGENTHOST=my-dev-machine" jaegerdemoclient

A loop of 5000 is crashing the agent for me on labs.play-with-docker.com .

To run it natively just run dotnet run in the repo (with dotnet 5 installed), the app will default to localhost:6831 for the agent.

How can I run this on Linux?

Just copy these two files somewhere and do dotnet run, after installing .NET 5: https://dotnet.microsoft.com/download