opentelemetry-rust: `opentelemetry::global::shutdown_tracer_provider()` Hangs indefinitely, and other troubles
Hello,
Been trying to get tracing exporting going to either Jaeger or Zipkin, and with both I’m running into issues actually getting much exported.
Following the readme for either, they mention to call global::shut_down_provider(); // sending remaining spans
on exit (which, the Readmes seem to be outdated). Calling what looks like the current equivalent (in the title) just locks up entirely. I’m using opentelemetry with the rt-tokio
feature, and both the Jaeger and Zipkin crates, with or without their tokio
features, exhibit this behavior.
When I don’t call this and instead do something silly like just sleep for a while at the end of a test, I’ll get output like:
OpenTelemetry trace error occurred ExportFailed(ThriftAgentError(ProtocolError { kind: SizeLimit, message: "jaeger exporter payload size of 248781 bytes over max UDP packet size of 65000 bytes. Try setting a smaller batch size." }))
OpenTelemetry trace error occurred ExportFailed(ThriftAgentError(ProtocolError { kind: SizeLimit, message: "jaeger exporter payload size of 112387 bytes over max UDP packet size of 65000 bytes. Try setting a smaller batch size." }))
OpenTelemetry trace error occurred Other(TrySendError { kind: Disconnected })
OpenTelemetry trace error occurred Other(TrySendError { kind: Disconnected })
Somehow I had Jaeger exporting working OK before (never with opentelemetry::global::set_text_map_propagator(opentelemetry_jaeger::Propagator::new());
or the shutdown call), but when trying to run a much larger scale load test I was never able to collect anything because of the packet size limits. The error helpfully suggests I try setting a smaller batch size, but I was totally unable to determine how to do that after reviewing the docs for a while.
Thanks for the hard work on this crate – my honest feedback is that the API is pretty challenging to figure out. Taking something like the above batch size problem, I can find https://docs.rs/opentelemetry/0.13.0/opentelemetry/sdk/trace/struct.BatchSpanProcessorBuilder.html#method.with_max_export_batch_size which looks like it might be what I want, but I can’t actually find any way to use that in combination with the Jaeger exporter.
Thanks again! Let me know if I can provide more info or feedback.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (6 by maintainers)
Commits related to this issue
- use the multi_thread executor to run the telemetry hang test The test wasn't using the multi_thread executor, so eventually it would trip over the bug in opentelemetry which we've hit in the past whe... — committed to apollographql/router by garypen 2 years ago
- use the multi_thread executor to run the telemetry hang test (#922) The test wasn't using the multi_thread executor, so eventually it would trip over the bug in opentelemetry which we've hit in the ... — committed to apollographql/router by garypen 2 years ago
I see a similar bug in Apollo router, running on the tokio multi threaded runtime, in the main code, not in tests. Reproduction info is here.
The exporter is created here.
The fix proposed by @simon-an works, but I took some time to dig a bit more into it.
It happens when we try to reload the configuration and create a new subscriber. We need to replace the tracer provider because we might switch to a different exporter (I feel there’s another way to do that without replacing the tracer provider but couldn’t find it) It hangs when the
try_initialize_subscriber
function drops the result ofset_tracer_provider
, and I tracked it down toBatchSpanProcessor::shutdown
:https://github.com/open-telemetry/opentelemetry-rust/blob/a81c6c6e6c1d9c5a0def5384b5c6291bb56f0f18/opentelemetry/src/sdk/trace/span_processor.rs#L266-L275
That function is called because the
TracerProvider
calls shutdown on all its tracers on drop.Through generous println applications, I see that timeline:
set_tracer_provider
is calledTracerProvider
is droppedBatchSpanProcessor::shutdown
is called. At this point the batch task is still running fine, regularly receiving flush messages from the tickerI suspect a bad interaction between the tokio runtime and the futures executor. I tried replacing it with a call to tokio’s
block_on
, but then I get a panic because I am trying to start a new tokio runtime from inside an existing runtime.This is not a blocking bug for us, as I said the proposed fix works. But it would be nice to get a good fix inside the opentelemetry crate, and I hope I gave enough context to help in fixing it 🙂
@TommyCpp i found a solution and updated the example project.
Thanks for pointing out, that there might be a current thread scheduler in a multithreaded runtime. That was not that clear to me, but makes scence in the context of a webserver, where thread::spawn should never be executed on a other worker’s thread.
@TommyCpp Ah, sure this makes sense and fixes it. It would be really great if the examples (READMEs like the one here) mentioned that using a multithreaded runtime, or setting the current thread batch executor is necessary to avoid blocking - I think probably a lot of people are going to try using the library via tests like this when they first try it out and (obviously, lol) this can be a confusing thing to run into.
Thanks for the help! Feel free to close this issue, really appreciate it.