prefect: Occasional http2 connection errors (KeyError)

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar issue and didn’t find it.
  • I searched the Prefect documentation for this issue.
  • I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

Occasionally, flows crash with a connection-related exception that seems to originate from h2. So far this could only be observed in longer flow runs (>2h) and seems not to be related to any specific workload.

Possibly related to https://github.com/PrefectHQ/prefect/issues/7442, https://github.com/PrefectHQ/prefect/pull/9429

Reproduction

Let enough flows run for long enough.

Error

Crash detected! Execution was interrupted by an unexpected exception: KeyError: 789

prefect.flow_runs
Crash details:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/prefect/_internal/concurrency/calls.py", line 293, in aresult
    return await asyncio.wrap_future(self.future)
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/contextlib.py", line 189, in __aexit__
    await self.gen.athrow(typ, value, traceback)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/prefect/task_runners.py", line 187, in start
    yield self
  File "/home/ray/anaconda3/lib/python3.8/site-packages/prefect/engine.py", line 539, in begin_flow_run
    terminal_or_paused_state = await orchestrate_flow_run(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/prefect/engine.py", line 849, in orchestrate_flow_run
    result = await flow_call.aresult()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/prefect/_internal/concurrency/calls.py", line 295, in aresult
    raise CancelledError() from exc
prefect._internal.concurrency.cancellation.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/prefect/engine.py", line 2221, in report_flow_run_crashes
    yield
  File "/home/ray/anaconda3/lib/python3.8/contextlib.py", line 662, in __aexit__
    cb_suppress = await cb(*exc_details)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/home/ray/anaconda3/lib/python3.8/site-packages/prefect/engine.py", line 1597, in create_task_run_then_submit
    task_run = await create_task_run(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/prefect/engine.py", line 1642, in create_task_run
    task_run = await flow_run_context.client.create_task_run(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/prefect/client/orchestration.py", line 1986, in create_task_run
    response = await self._client.post(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpx/_client.py", line 1877, in post
    return await self.request(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpx/_client.py", line 1559, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/prefect/client/base.py", line 282, in send
    response = await self._send_with_retry(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/prefect/client/base.py", line 216, in _send_with_retry
    response = await request()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpx/_client.py", line 1646, in send
    response = await self._send_handling_auth(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpx/_client.py", line 1674, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpx/_client.py", line 1711, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpx/_client.py", line 1748, in _send_single_request
    response = await transport.handle_async_request(request)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpx/_transports/default.py", line 371, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpcore/_async/connection_pool.py", line 268, in handle_async_request
    raise exc
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpcore/_async/connection_pool.py", line 251, in handle_async_request
    response = await connection.handle_async_request(request)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpcore/_async/connection.py", line 103, in handle_async_request
    return await self._connection.handle_async_request(request)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpcore/_async/http2.py", line 185, in handle_async_request
    raise exc
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpcore/_async/http2.py", line 144, in handle_async_request
    await self._send_request_body(request=request, stream_id=stream_id)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpcore/_async/http2.py", line 261, in _send_request_body
    await self._send_end_stream(request, stream_id)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/httpcore/_async/http2.py", line 280, in _send_end_stream
    self._h2_state.end_stream(stream_id)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/h2/connection.py", line 883, in end_stream
    frames = self.streams[stream_id].end_stream()
KeyError: 789

Versions

Version:             2.14.11
API version:         0.8.4
Python version:      3.8.15
Git commit:          e6d7d76d
Built:               Thu, Dec 14, 2023 5:45 PM
OS/Arch:             linux/x86_64
Server type:         cloud

Additional context

The stream_id from the final KeyError is different for each crash.

About this issue

  • Original URL
  • State: open
  • Created 5 months ago
  • Comments: 16 (7 by maintainers)

Most upvoted comments

Hey, it’s really difficult to say what is going on without a reproduction.

Can you try pinning h2 < 4.0.0 and see if that helps? It looks like they released 4.0.0 2 weeks ago and it lines up with the timeline for your errors. clicked on the wrong h2 repo 🤦‍♂️