opentelemetry-python: Process never exits due to hang in shutdown
We have a process that sometimes fails to exit until we hit ctrl-c
. Traceback at that time shows the process is hung, attempting to “join” a worker thread:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/__init__.py", line 1136, in shutdown
self._active_span_processor.shutdown()
File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/__init__.py", line 166, in shutdown
sp.shutdown()
File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/export/__init__.py", line 385, in shutdown
self.worker_thread.join()
File "/usr/lib/python3.7/threading.py", line 1044, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
In this case, the process completed in about 500 ms and would have produced only a few spans. The process was configured to export to a remote OT collector which may or may not have been reachable at the time - but we expect the “reachability” of the collector to not interfere with process execution. We gave it like 30 seconds to complete before terminating the process.
This, of course, interferes with auto-scaling and our ability to manage processes.
Please advise.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 6
- Comments: 30 (15 by maintainers)
Commits related to this issue
- export and shutdown timeouts for all OTLP exporters Fixes #2284, #2663, #3309 — committed to Arnatious/opentelemetry-python by Arnatious 4 months ago
- export and shutdown timeouts for all OTLP exporters Fixes #2284, #2663, #3309 — committed to Arnatious/opentelemetry-python by Arnatious 4 months ago
- export and shutdown timeouts for all OTLP exporters Fixes #2284, #2663, #3309 — committed to Arnatious/opentelemetry-python by Arnatious 4 months ago
- export and shutdown timeouts for all OTLP exporters Fixes #2284, #2663, #3309 — committed to Arnatious/opentelemetry-python by Arnatious 4 months ago
- export and shutdown timeouts for all OTLP exporters Fixes #2284, #2663, #3309 — committed to Arnatious/opentelemetry-python by Arnatious 4 months ago
@ocelotl @lonewolf3739 I think retry mechanism with jitter should only apply during normal operation of a service. That should be enough to comply with the spec. Once an application starts to shut down, I think it might be more reasonable to make an immediate and final (or may be a few without exponential back-off) attempt at exporting everything in memory and quit after that.
I am still seeing this problem. Is there a work around?
Error:
It is mostly waiting. We have to do this, it is required in the spec and the intention of the spec is to protect the destination from being overwhelmed. If we allowed the exporting to wait 0 seconds we would be making it possible for this to happen and it would be against the spec. In my opinion we should not allow the user to introduce another delay algorithm, as the spec requires this one.
If you have specific purposes, testing or otherwise, I would recommend to make a custom exporter, it should not be very hard:
https://github.com/open-telemetry/opentelemetry-python/blob/main/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py#L310
Changing
delay
to0
in that line should do the trick.I’m sorry, I wish we could provide a solution for this directly in our published code, but I think for this particular situation the spec does not allow us to do so (@owais, @lonewolf3739 please comment if you disagree).
Please let me know if you need help implementing this, @flashgorman-chorus, you can contact me directly via Slack at the CNCF workspace.
Thanks. Will await your thoughtful decision. But, just to be clear, we are talking about a process exiting naturally - no human or Ctrl+C involved. I don’t know if SIGKILL is involved under “natural exit” but if that only happens under Ctrl+C or
kill <pid>
scenarios, I wanted to make sure it is clear we are hoping for natural process exit without human intervention and without delay.@flashgorman-chorus there is no special timeout during shutdown today. We just have a single retry timeout which we could make configurable but I’m not sure if setting it to
0
would be a good solution to this problem.I think what we should do is detect SIGKILL and make one last attempt to flush everything in memory without retry.