opentelemetry-collector: Deadlock in groupbytrace processor

Describe the bug When using the groupbytrace processor the collector deadlocks, causing incoming requests to hang (no response is sent back) and the rest of the pipeline stops doing work (in memory traces aren’t flushed to the exporters)

Steps to reproduce Configure a pipeline with groupbytrace, using a low wait_duration makes reproduction easier, but the issue exists at normal durations like 1s.

Use the synthetic-load-generator to generate load

I modified the 100k/s file to emit much lower ~2k/s and was able to reproduce the issue after 15-30s of running the load generator. You can tell it deadlocked when the collector’s logging output has stopped, at this point you can send post requests and should observe it doesn’t return a response.

java -jar target/SyntheticLoadGenerator-1.0-SNAPSHOT-jar-with-dependencies.jar  --paramsFile ./topologies/100_000_spans_per_second.json --zipkinV2JsonUrl http://localhost:9411

The modifications to the config file were to change "tracesPerHour" : 1800000 to "tracesPerHour" : 36000

I performed the same test against the tail_sampling processor and it did not cause a deadlock

What version did you use? go.opentelemetry.io/collector v0.9.1-0.20200911135115-e886a01ebe2e

What config did you use?

receivers:
  zipkin:
    endpoint: "0.0.0.0:9411"

exporters:
  logging:
    loglevel: info

processors:
  groupbytrace:
    wait_duration: 1s

service:
  pipelines:
    traces:
      receivers: [zipkin]
      processors: [groupbytrace]
      exporters: [logging]

Environment Reproduced on my laptop, but encountered the issue originally on Amazon Linux

OS: OSX 10.14 Compiler: go version go1.14.3 darwin/amd64

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 18 (18 by maintainers)

Commits related to this issue

Add test case to reproduce deadlock #1811 — committed to Zocdoc/opentelemetry-collector by chris-smith-zocdoc 4 years ago
Fix deadlock #1811 by simplifying the event machine used by the groupbytrace processor There are only two events now, one for adding spans and one for releasing the trace. This removes the prior case... — committed to Zocdoc/opentelemetry-collector by chris-smith-zocdoc 4 years ago
Add test case to reproduce deadlock #1811 — committed to Zocdoc/opentelemetry-collector by chris-smith-zocdoc 4 years ago
Fix deadlock #1811 by simplifying the event machine used by the groupbytrace processor There are only two events now, one for adding spans and one for releasing the trace. This removes the prior case... — committed to Zocdoc/opentelemetry-collector by chris-smith-zocdoc 4 years ago
Fixed deadlock in groupbytrace processor. * Drain the queue upon shutdown, with a time limit. Fixes #1465. * Added metrics to the groupbyprocessor, making it easier to understand what's going on in c... — committed to jpkrohling/opentelemetry-collector by jpkrohling 4 years ago
Fixed deadlock in groupbytrace processor. * Drain the queue upon shutdown, with a time limit. Fixes #1465. * Added metrics to the groupbyprocessor, making it easier to understand what's going on in c... — committed to jpkrohling/opentelemetry-collector by jpkrohling 4 years ago
Fixed deadlock in groupbytrace processor. * Drain the queue upon shutdown, with a time limit. Fixes #1465. * Added metrics to the groupbyprocessor, making it easier to understand what's going on in c... — committed to jpkrohling/opentelemetry-collector by jpkrohling 4 years ago
Add metrics to groupbytraceprocessor, wait for queue to be drained during shutdown. (#1842) Fixed deadlock in groupbytrace processor. * Drain the queue upon shutdown, with a time limit. Fixes #146... — committed to open-telemetry/opentelemetry-collector by jpkrohling 4 years ago
add support for env var configuration to otlp/gRPC (#1811) * move options to `otlpconfig` internal package * add support for env configs on otel/gRPC * remove duplicate code * refactor optio... — committed to atlassian-forks/opentelemetry-collector by paivagustavo 3 years ago
Add gemfile for fpm and install with bundler (#1811) — committed to hughesjj/opentelemetry-collector by jeffreyc-splunk 2 years ago

Most upvoted comments

Tuesday works for me, I’m in EST. We can coordinate the times on gitter on Monday

chris-smith-zocdoc on Sep 25, 2020

I replied on the pr, there is another deadlock besides the one in memoryStorage https://github.com/open-telemetry/opentelemetry-collector/pull/1847#issuecomment-698452404

chris-smith-zocdoc on Sep 24, 2020

Yes I saw this deadlock in one of our environments.

I was able to reproduce the issue in a unit test and (I think) have a good understanding of what is occurring. The test is here https://github.com/zocdoc/opentelemetry-collector/commit/c00320cf78b75a70e2b798fc75aa36bfc8072993

I believe it occurs when the following conditions are met:

The number of incoming traces exceeds the ring buffer size
The number of incoming traces exceeds 200 (the channel buffer size)

When a new trace arrives and it must remove an item from the ring buffer, it tries to write to the channel. https://github.com/open-telemetry/opentelemetry-collector/blob/master/processor/groupbytraceprocessor/processor.go#L153-L156

This constitutes a “re-entry” since this goroutine is already holding the channel lock, it only works sometimes because of the channel buffer.

The deadlocked goroutine stacks

goroutine 74361 [chan send, 2 minutes]:
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*eventMachine).fire(0xc0000a40f0, 0xc01c287fa0, 0x1, 0x1)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/event.go:162 +0xcd
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*groupByTraceProcessor).ConsumeTraces(...)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor.go:86
go.opentelemetry.io/collector/processor/groupbytraceprocessor.TestHighConcurrency.func3(0xc010b9fd50, 0xc010ba21c0, 0xc00a1f75e0)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor_test.go:800 +0x7f
created by go.opentelemetry.io/collector/processor/groupbytraceprocessor.TestHighConcurrency
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor_test.go:799 +0x415


goroutine 7 [chan send, 2 minutes]:
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*eventMachine).fire(0xc0000a40f0, 0xc04bebdb18, 0x1, 0x1)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/event.go:162 +0xcd
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*groupByTraceProcessor).processBatch(0xc010b9fd50, 0xc010457a80, 0x1, 0x1)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor.go:153 +0x556
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*groupByTraceProcessor).processResourceSpans(0xc010b9fd50, 0xc0063b8940, 0x1038adc, 0xc01042fdc8)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor.go:126 +0xa5
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*groupByTraceProcessor).onTraceReceived(0xc010b9fd50, 0xc0063bb660, 0x1, 0x101)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/processor.go:111 +0x73
go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*eventMachine).start(0xc0000a40f0)
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/event.go:97 +0x138
created by go.opentelemetry.io/collector/processor/groupbytraceprocessor.(*eventMachine).startInBackground
	~/src/opentelemetry-collector/processor/groupbytraceprocessor/event.go:79 +0x3f

While writing the test (and trying out some fixes) I found some other bugs in the implementation that should also be addressed (some scattered notes in the test). I’d be happy to submit a pr to address these if that works for you @jpkrohling

chris-smith-zocdoc on Sep 24, 2020