opentelemetry-rust: `opentelemetry::global::shutdown_tracer_provider()` Hangs indefinitely, and other troubles

Hello,

Been trying to get tracing exporting going to either Jaeger or Zipkin, and with both I’m running into issues actually getting much exported.

Following the readme for either, they mention to call global::shut_down_provider(); // sending remaining spans on exit (which, the Readmes seem to be outdated). Calling what looks like the current equivalent (in the title) just locks up entirely. I’m using opentelemetry with the rt-tokio feature, and both the Jaeger and Zipkin crates, with or without their tokio features, exhibit this behavior.

When I don’t call this and instead do something silly like just sleep for a while at the end of a test, I’ll get output like:

OpenTelemetry trace error occurred ExportFailed(ThriftAgentError(ProtocolError { kind: SizeLimit, message: "jaeger exporter payload size of 248781 bytes over max UDP packet size of 65000 bytes. Try setting a smaller batch size." }))
OpenTelemetry trace error occurred ExportFailed(ThriftAgentError(ProtocolError { kind: SizeLimit, message: "jaeger exporter payload size of 112387 bytes over max UDP packet size of 65000 bytes. Try setting a smaller batch size." }))
OpenTelemetry trace error occurred Other(TrySendError { kind: Disconnected })
OpenTelemetry trace error occurred Other(TrySendError { kind: Disconnected })

Somehow I had Jaeger exporting working OK before (never with opentelemetry::global::set_text_map_propagator(opentelemetry_jaeger::Propagator::new()); or the shutdown call), but when trying to run a much larger scale load test I was never able to collect anything because of the packet size limits. The error helpfully suggests I try setting a smaller batch size, but I was totally unable to determine how to do that after reviewing the docs for a while.

Thanks for the hard work on this crate – my honest feedback is that the API is pretty challenging to figure out. Taking something like the above batch size problem, I can find https://docs.rs/opentelemetry/0.13.0/opentelemetry/sdk/trace/struct.BatchSpanProcessorBuilder.html#method.with_max_export_batch_size which looks like it might be what I want, but I can’t actually find any way to use that in combination with the Jaeger exporter.

Thanks again! Let me know if I can provide more info or feedback.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 15 (6 by maintainers)

Commits related to this issue

use the multi_thread executor to run the telemetry hang test The test wasn't using the multi_thread executor, so eventually it would trip over the bug in opentelemetry which we've hit in the past whe... — committed to apollographql/router by garypen 2 years ago
use the multi_thread executor to run the telemetry hang test (#922) The test wasn't using the multi_thread executor, so eventually it would trip over the bug in opentelemetry which we've hit in the ... — committed to apollographql/router by garypen 2 years ago

Most upvoted comments

I see a similar bug in Apollo router, running on the tokio multi threaded runtime, in the main code, not in tests. Reproduction info is here.

The exporter is created here.

The fix proposed by @simon-an works, but I took some time to dig a bit more into it.

It happens when we try to reload the configuration and create a new subscriber. We need to replace the tracer provider because we might switch to a different exporter (I feel there’s another way to do that without replacing the tracer provider but couldn’t find it) It hangs when the try_initialize_subscriber function drops the result of set_tracer_provider, and I tracked it down to BatchSpanProcessor::shutdown:

https://github.com/open-telemetry/opentelemetry-rust/blob/a81c6c6e6c1d9c5a0def5384b5c6291bb56f0f18/opentelemetry/src/sdk/trace/span_processor.rs#L266-L275

That function is called because the TracerProvider calls shutdown on all its tracers on drop.

Through generous println applications, I see that timeline:

set_tracer_provider is called
TracerProvider is dropped
BatchSpanProcessor::shutdown is called. At this point the batch task is still running fine, regularly receiving flush messages from the ticker
the shutdown function sends the shutdown message
the shutdown function blocks on the receiver. It hangs at this point. The batch task does not receive any new message, which is very surprising (the ticker should continue to send)
then the new batch task from the new tracer provider spawns and starts receiving flush messages from the ticker

I suspect a bad interaction between the tokio runtime and the futures executor. I tried replacing it with a call to tokio’s block_on, but then I get a panic because I am trying to start a new tokio runtime from inside an existing runtime.

This is not a blocking bug for us, as I said the proposed fix works. But it would be nice to get a good fix inside the opentelemetry crate, and I hope I gave enough context to help in fixing it 🙂

Geal on Jan 21, 2022

@TommyCpp i found a solution and updated the example project.

    Handle::current().spawn_blocking(move || {
        opentelemetry::global::shutdown_tracer_provider();
    }).await.unwrap();

Thanks for pointing out, that there might be a current thread scheduler in a multithreaded runtime. That was not that clear to me, but makes scence in the context of a webserver, where thread::spawn should never be executed on a other worker’s thread.

simon-an on May 18, 2021

@TommyCpp Ah, sure this makes sense and fixes it. It would be really great if the examples (READMEs like the one here) mentioned that using a multithreaded runtime, or setting the current thread batch executor is necessary to avoid blocking - I think probably a lot of people are going to try using the library via tests like this when they first try it out and (obviously, lol) this can be a confusing thing to run into.

Thanks for the help! Feel free to close this issue, really appreciate it.

Sushisource on May 14, 2021