opentelemetry-python: Process never exits due to hang in shutdown

We have a process that sometimes fails to exit until we hit ctrl-c. Traceback at that time shows the process is hung, attempting to “join” a worker thread:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/__init__.py", line 1136, in shutdown
    self._active_span_processor.shutdown()
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/__init__.py", line 166, in shutdown
    sp.shutdown()
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/export/__init__.py", line 385, in shutdown
    self.worker_thread.join()
  File "/usr/lib/python3.7/threading.py", line 1044, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):

In this case, the process completed in about 500 ms and would have produced only a few spans. The process was configured to export to a remote OT collector which may or may not have been reachable at the time - but we expect the “reachability” of the collector to not interfere with process execution. We gave it like 30 seconds to complete before terminating the process.

This, of course, interferes with auto-scaling and our ability to manage processes.

Please advise.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 6
  • Comments: 30 (15 by maintainers)

Commits related to this issue

Most upvoted comments

@ocelotl @lonewolf3739 I think retry mechanism with jitter should only apply during normal operation of a service. That should be enough to comply with the spec. Once an application starts to shut down, I think it might be more reasonable to make an immediate and final (or may be a few without exponential back-off) attempt at exporting everything in memory and quit after that.

I am still seeing this problem. Is there a work around?

Error:

^CException ignored in atexit callback: <bound method MeterProvider.shutdown of <opentelemetry.sdk.metrics._internal.MeterProvider object at 0x7fffe6ca54e0>>
Traceback (most recent call last):
  File "/home/nv/utils/ci/nvci_create/1.2.1/lib/python3.10/site-packages/opentelemetry/sdk/metrics/_internal/__init__.py", line 440, in shutdown
    metric_reader.shutdown(
  File "/home/nv/utils/ci/nvci_create/1.2.1/lib/python3.10/site-packages/opentelemetry/sdk/metrics/_internal/export/__init__.py", line 543, in shutdown
    self._daemon_thread.join(
  File "/home/utils/Python/builds/3.10.10-20230601/lib/python3.10/threading.py", line 1100, in join
    self._wait_for_tstate_lock(timeout=max(timeout, 0))
  File "/home/utils/Python/builds/3.10.10-20230601/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):

What I don’t understand is _what is it trying to do for 64 seconds?

It is mostly waiting. We have to do this, it is required in the spec and the intention of the spec is to protect the destination from being overwhelmed. If we allowed the exporting to wait 0 seconds we would be making it possible for this to happen and it would be against the spec. In my opinion we should not allow the user to introduce another delay algorithm, as the spec requires this one.

If you have specific purposes, testing or otherwise, I would recommend to make a custom exporter, it should not be very hard:

https://github.com/open-telemetry/opentelemetry-python/blob/main/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py#L310

Changing delay to 0 in that line should do the trick.

I’m sorry, I wish we could provide a solution for this directly in our published code, but I think for this particular situation the spec does not allow us to do so (@owais, @lonewolf3739 please comment if you disagree).

Please let me know if you need help implementing this, @flashgorman-chorus, you can contact me directly via Slack at the CNCF workspace.

Thanks. Will await your thoughtful decision. But, just to be clear, we are talking about a process exiting naturally - no human or Ctrl+C involved. I don’t know if SIGKILL is involved under “natural exit” but if that only happens under Ctrl+C or kill <pid> scenarios, I wanted to make sure it is clear we are hoping for natural process exit without human intervention and without delay.

@flashgorman-chorus there is no special timeout during shutdown today. We just have a single retry timeout which we could make configurable but I’m not sure if setting it to 0 would be a good solution to this problem.

I think what we should do is detect SIGKILL and make one last attempt to flush everything in memory without retry.