opentelemetry-collector: Deadlock in groupbytrace processor

Describe the bug When using the groupbytrace processor the collector deadlocks, causing incoming requests to hang (no response is sent back) and the rest of the pipeline stops doing work (in memory traces aren’t flushed to the exporters)

Steps to reproduce Configure a pipeline with groupbytrace, using a low wait_duration makes reproduction easier, but the issue exists at normal durations like 1s.

Use the synthetic-load-generator to generate load

I modified the 100k/s file to emit much lower ~2k/s and was able to reproduce the issue after 15-30s of running the load generator. You can tell it deadlocked when the collector’s logging output has stopped, at this point you can send post requests and should observe it doesn’t return a response.

java -jar target/SyntheticLoadGenerator-1.0-SNAPSHOT-jar-with-dependencies.jar  --paramsFile ./topologies/100_000_spans_per_second.json --zipkinV2JsonUrl http://localhost:9411

The modifications to the config file were to change "tracesPerHour" : 1800000 to "tracesPerHour" : 36000

I performed the same test against the tail_sampling processor and it did not cause a deadlock

What version did you use? go.opentelemetry.io/collector v0.9.1-0.20200911135115-e886a01ebe2e

What config did you use?

receivers:
  zipkin:
    endpoint: "0.0.0.0:9411"

exporters:
  logging:
    loglevel: info

processors:
  groupbytrace:
    wait_duration: 1s

service:
  pipelines:
    traces:
      receivers: [zipkin]
      processors: [groupbytrace]
      exporters: [logging]

Environment Reproduced on my laptop, but encountered the issue originally on Amazon Linux

OS: OSX 10.14 Compiler: go version go1.14.3 darwin/amd64

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 18 (18 by maintainers)

Commits related to this issue

Most upvoted comments

Tuesday works for me, I’m in EST. We can coordinate the times on gitter on Monday

I replied on the pr, there is another deadlock besides the one in memoryStorage https://github.com/open-telemetry/opentelemetry-collector/pull/1847#issuecomment-698452404

Yes I saw this deadlock in one of our environments.

I was able to reproduce the issue in a unit test and (I think) have a good understanding of what is occurring. The test is here https://github.com/zocdoc/opentelemetry-collector/commit/c00320cf78b75a70e2b798fc75aa36bfc8072993

I believe it occurs when the following conditions are met:

  • The number of incoming traces exceeds the ring buffer size
  • The number of incoming traces exceeds 200 (the channel buffer size)

When a new trace arrives and it must remove an item from the ring buffer, it tries to write to the channel. https://github.com/open-telemetry/opentelemetry-collector/blob/master/processor/groupbytraceprocessor/processor.go#L153-L156

This constitutes a “re-entry” since this goroutine is already holding the channel lock, it only works sometimes because of the channel buffer.

The deadlocked goroutine stacks

goroutine 74361 [chan send, 2 minutes]:
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*eventMachine).fire(0xc0000a40f0, 0xc01c287fa0, 0x1, 0x1)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/event.go:162 +0xcd
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*groupByTraceProcessor).ConsumeTraces(...)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor.go:86
go.opentelemetry.io/collector/processor/groupbytraceprocessor.TestHighConcurrency.func3(0xc010b9fd50, 0xc010ba21c0, 0xc00a1f75e0)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor_test.go:800 +0x7f
created by go.opentelemetry.io/collector/processor/groupbytraceprocessor.TestHighConcurrency
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor_test.go:799 +0x415


goroutine 7 [chan send, 2 minutes]:
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*eventMachine).fire(0xc0000a40f0, 0xc04bebdb18, 0x1, 0x1)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/event.go:162 +0xcd
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*groupByTraceProcessor).processBatch(0xc010b9fd50, 0xc010457a80, 0x1, 0x1)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor.go:153 +0x556
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*groupByTraceProcessor).processResourceSpans(0xc010b9fd50, 0xc0063b8940, 0x1038adc, 0xc01042fdc8)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor.go:126 +0xa5
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*groupByTraceProcessor).onTraceReceived(0xc010b9fd50, 0xc0063bb660, 0x1, 0x101)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor.go:111 +0x73
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*eventMachine).start(0xc0000a40f0)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/event.go:97 +0x138
created by go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*eventMachine).startInBackground
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/event.go:79 +0x3f

While writing the test (and trying out some fixes) I found some other bugs in the implementation that should also be addressed (some scattered notes in the test). I’d be happy to submit a pr to address these if that works for you @jpkrohling