opentelemetry-python: Implement timeout mechanism for several metrics components
collect
doesn’t return anything, so returning True until we implement timeout/error handling mechanism seemed reasonable to me. What do you think?
_Originally posted by @lonewolf3739 in https://github.com/open-telemetry/opentelemetry-python/pull/2401#discussion_r790980110_
Also, the timeout mechanism is required in several parts of the metrics spec:
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (21 by maintainers)
Based on the linked issues, the previous decision was to not use the signal based approach for implementing generic timeout mechanism because it won’t work outside of the main thread, and it can interfere with applications using the same signal handler already.
For some background,
join()
with a certain timeoutBased on that, I think it’s reasonable to just pass down a timeout duration where possible and document that implementors should respect the timeout. This doesn’t actually implement “SHOULD complete or abort within some timeout” but is not terribly opinionated.
It would be great if the mechanism we add for metrics can also be used in the tracing SDK, where timeouts were never implemented previously. The problem is that adding a timeout optional parameter to the SpanExporter, SpanProcessor etc. interface signatures will break existing implementations of those interfaces when we go to call them. We could potentially defensively call with the timeout param and fallback to not passing timeout e.g. here
Yes trying to prevent them from making mistakes and messing up the signature. Im implementing this in https://github.com/open-telemetry/opentelemetry-python/pull/2664
I guess what I’m proposing is all of these blocking calls respect the timeout they receive as the maximum for the entire blocking call e.g.
Exporter.export(..., timeout)
should never block longer than the timeout passed in. If the exporter is doing queueing or backoffs, it can implement that as it pleases, respecting the overall deadline for the export. That would work for the scenario whereOTEL_METRIC_EXPORT_TIMEOUT < OTEL_EXPORTER_OTLP_TIMEOUT
, or maybe you’re sending cumulatives every 5 seconds and you’d rather drop a slow request and send the next collection interval’s data instead of doing exponential backoff.This is basically how Go contexts work out of the box. When you create a context with a timeout, it calculates the deadline as
now + timeout
. If you make a child contexts with a different timeout (e.g. a single request in a set of backoffs), it will respect the sooner deadline between parent and child contexts. That is the most intuitive behavior to me.Our
SynchronousMultiSpanProcessor.force_flush()
already behaves this way for example.I think potentially we should try to add
*args, **kwargs
or config objects to any user implementable interfaces going forward to avoid this kind of issue.