opentelemetry-python: Implement timeout mechanism for several metrics components

collect doesn’t return anything, so returning True until we implement timeout/error handling mechanism seemed reasonable to me. What do you think?

_Originally posted by @lonewolf3739 in https://github.com/open-telemetry/opentelemetry-python/pull/2401#discussion_r790980110_

Also, the timeout mechanism is required in several parts of the metrics spec:

  1. MeterProvider.shutdown
  2. MeterProvider.forceflush
  3. Asynchronous callbacks
  4. MetricReader.collect
  5. MetricReader.shutdown
  6. PushMetricExporter.forceflush
  7. PushMetricExporter.shutdown

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

Based on the linked issues, the previous decision was to not use the signal based approach for implementing generic timeout mechanism because it won’t work outside of the main thread, and it can interfere with applications using the same signal handler already.

For some background,

  • The Java methods return a future like object which lets user join() with a certain timeout
  • JS doesn’t appear to do anything special, just returns a Promise or accepts a callback. There is no way to cancel
  • Go simply passes the ctx and tells the implementor to respect the cancellation/timeout
  • .NET accepts timeout for shutdown and force flush which the implementor should respect

Based on that, I think it’s reasonable to just pass down a timeout duration where possible and document that implementors should respect the timeout. This doesn’t actually implement “SHOULD complete or abort within some timeout” but is not terribly opinionated.

It would be great if the mechanism we add for metrics can also be used in the tracing SDK, where timeouts were never implemented previously. The problem is that adding a timeout optional parameter to the SpanExporter, SpanProcessor etc. interface signatures will break existing implementations of those interfaces when we go to call them. We could potentially defensively call with the timeout param and fallback to not passing timeout e.g. here

try:
  self._active_span_processor.shutdown(timeout=timeout)
except TypeError:
  self._active_span_processor.shutdown()

2. Do you mean that if we recommended users to use **kwargs they would misuse it? Are we trying to make sure mistakes won’t be made by forcing them to pass in a dataclass?

Yes trying to prevent them from making mistakes and messing up the signature. Im implementing this in https://github.com/open-telemetry/opentelemetry-python/pull/2664

So passing timeout_millis down the call stack till exporter is not the really the expected behaviour?

I guess what I’m proposing is all of these blocking calls respect the timeout they receive as the maximum for the entire blocking call e.g. Exporter.export(..., timeout) should never block longer than the timeout passed in. If the exporter is doing queueing or backoffs, it can implement that as it pleases, respecting the overall deadline for the export. That would work for the scenario where OTEL_METRIC_EXPORT_TIMEOUT < OTEL_EXPORTER_OTLP_TIMEOUT, or maybe you’re sending cumulatives every 5 seconds and you’d rather drop a slow request and send the next collection interval’s data instead of doing exponential backoff.

This is basically how Go contexts work out of the box. When you create a context with a timeout, it calculates the deadline as now + timeout. If you make a child contexts with a different timeout (e.g. a single request in a set of backoffs), it will respect the sooner deadline between parent and child contexts. That is the most intuitive behavior to me.

Our SynchronousMultiSpanProcessor.force_flush() already behaves this way for example.

I think potentially we should try to add *args, **kwargs or config objects to any user implementable interfaces going forward to avoid this kind of issue.