opentelemetry-collector: Deadlock in groupbytrace processor
Describe the bug
When using the groupbytrace
processor the collector deadlocks, causing incoming requests to hang (no response is sent back) and the rest of the pipeline stops doing work (in memory traces aren’t flushed to the exporters)
Steps to reproduce
Configure a pipeline with groupbytrace, using a low wait_duration
makes reproduction easier, but the issue exists at normal durations like 1s.
Use the synthetic-load-generator to generate load
I modified the 100k/s file to emit much lower ~2k/s and was able to reproduce the issue after 15-30s of running the load generator. You can tell it deadlocked when the collector’s logging output has stopped, at this point you can send post requests and should observe it doesn’t return a response.
java -jar target/SyntheticLoadGenerator-1.0-SNAPSHOT-jar-with-dependencies.jar --paramsFile ./topologies/100_000_spans_per_second.json --zipkinV2JsonUrl http://localhost:9411
The modifications to the config file were to change "tracesPerHour" : 1800000
to "tracesPerHour" : 36000
I performed the same test against the tail_sampling
processor and it did not cause a deadlock
What version did you use?
go.opentelemetry.io/collector v0.9.1-0.20200911135115-e886a01ebe2e
What config did you use?
receivers:
zipkin:
endpoint: "0.0.0.0:9411"
exporters:
logging:
loglevel: info
processors:
groupbytrace:
wait_duration: 1s
service:
pipelines:
traces:
receivers: [zipkin]
processors: [groupbytrace]
exporters: [logging]
Environment Reproduced on my laptop, but encountered the issue originally on Amazon Linux
OS: OSX 10.14 Compiler: go version go1.14.3 darwin/amd64
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 18 (18 by maintainers)
Commits related to this issue
- Add test case to reproduce deadlock #1811 — committed to Zocdoc/opentelemetry-collector by chris-smith-zocdoc 4 years ago
- Fix deadlock #1811 by simplifying the event machine used by the groupbytrace processor There are only two events now, one for adding spans and one for releasing the trace. This removes the prior case... — committed to Zocdoc/opentelemetry-collector by chris-smith-zocdoc 4 years ago
- Add test case to reproduce deadlock #1811 — committed to Zocdoc/opentelemetry-collector by chris-smith-zocdoc 4 years ago
- Fix deadlock #1811 by simplifying the event machine used by the groupbytrace processor There are only two events now, one for adding spans and one for releasing the trace. This removes the prior case... — committed to Zocdoc/opentelemetry-collector by chris-smith-zocdoc 4 years ago
- Fixed deadlock in groupbytrace processor. * Drain the queue upon shutdown, with a time limit. Fixes #1465. * Added metrics to the groupbyprocessor, making it easier to understand what's going on in c... — committed to jpkrohling/opentelemetry-collector by jpkrohling 4 years ago
- Fixed deadlock in groupbytrace processor. * Drain the queue upon shutdown, with a time limit. Fixes #1465. * Added metrics to the groupbyprocessor, making it easier to understand what's going on in c... — committed to jpkrohling/opentelemetry-collector by jpkrohling 4 years ago
- Fixed deadlock in groupbytrace processor. * Drain the queue upon shutdown, with a time limit. Fixes #1465. * Added metrics to the groupbyprocessor, making it easier to understand what's going on in c... — committed to jpkrohling/opentelemetry-collector by jpkrohling 4 years ago
- Add metrics to groupbytraceprocessor, wait for queue to be drained during shutdown. (#1842) Fixed deadlock in groupbytrace processor. * Drain the queue upon shutdown, with a time limit. Fixes #146... — committed to open-telemetry/opentelemetry-collector by jpkrohling 4 years ago
- add support for env var configuration to otlp/gRPC (#1811) * move options to `otlpconfig` internal package * add support for env configs on otel/gRPC * remove duplicate code * refactor optio... — committed to atlassian-forks/opentelemetry-collector by paivagustavo 3 years ago
- Add gemfile for fpm and install with bundler (#1811) — committed to hughesjj/opentelemetry-collector by jeffreyc-splunk 2 years ago
Tuesday works for me, I’m in EST. We can coordinate the times on gitter on Monday
I replied on the pr, there is another deadlock besides the one in memoryStorage https://github.com/open-telemetry/opentelemetry-collector/pull/1847#issuecomment-698452404
Yes I saw this deadlock in one of our environments.
I was able to reproduce the issue in a unit test and (I think) have a good understanding of what is occurring. The test is here https://github.com/zocdoc/opentelemetry-collector/commit/c00320cf78b75a70e2b798fc75aa36bfc8072993
I believe it occurs when the following conditions are met:
When a new trace arrives and it must remove an item from the ring buffer, it tries to write to the channel. https://github.com/open-telemetry/opentelemetry-collector/blob/master/processor/groupbytraceprocessor/processor.go#L153-L156
This constitutes a “re-entry” since this goroutine is already holding the channel lock, it only works sometimes because of the channel buffer.
The deadlocked goroutine stacks
While writing the test (and trying out some fixes) I found some other bugs in the implementation that should also be addressed (some scattered notes in the test). I’d be happy to submit a pr to address these if that works for you @jpkrohling