prefect: Network failures with self-hosted servers

This is tracking issue for various reports of network failures when self-hosting Prefect Orion.

Notably, these issues seem focused to usage of Prefect 2.6.6.

If adding a report to this issue, please include the following information:

  • If using Prefect official Docker images for the client or server, provide the image tags
  • On the server, we are interested in the Prefect version, the database, and server library versions
prefect version
pip freeze | grep -E '(uvicorn|starlette)'
  • On the client, we are interested in Prefect versions and the client HTTP library versions
prefect version
pip freeze | grep -E '(httpx|httpcore)'
  • Please include the full traceback for the error
  • Check for any related error logs on the server

Related to:

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 28 (14 by maintainers)

Most upvoted comments

We haven’t had this error since we upgraded to 2.7.0 earlier today 👍

@peytonrunyan: Haven’t tried the branch, but is retry the right solution here? Sounds a bit like playing the lottery to me, just keep resending the same request, hoping that eventually it would go through.

If the issue is caused by high volume of requests, as @madkinsz suggests - shouldn’t a solution involve some way to throttle/queue requests on the agents? (just thoughts, not that familiar with how Prefect is designed)

I am getting the same errors. I am running a custom Prefect 2.6.5 version (same for server and clients) with the agent-limit feature from this PR (but I do not think this is related). The logs are the same as what others mention like: ConnectionResetError: [Errno 104] Connection reset by peer and anyio.BrokenResourceError and httpcore.ReadError, httpx.ReadError

It happens when trying to call run_deployment from inside a task in the parent flow

This issue makes Prefect 2 pretty unstable for production environments

@madkinsz it looks like we got it addressed, so I’m going to go ahead and close this issue. Feel free to reopen it if you think there’s something else that needs handling.

I can also confirm that flows that were failing regularly for me with this issue now seem to be working fine

Howdy yall! Any chance anyone here would be interested in giving this branch a shot to see if it resolves the problems? https://github.com/PrefectHQ/prefect/pull/7593

@MuFaheemkhan , @eudyptula , @carlo-catalyst , @andreas-ntonas ,

We are seeing our BrokenPipeError: [Errno 32] Broken pipe in flows that are starting like 800 tasks in one go with map. The logs also include a RuntimeError: The connection pool was closed while 325 HTTP requests/responses were still in-flight.

Also, the reason behind the http 500 was a database timeout, so had to increase the query timeout from the default 1 second.

So, looking from our perspective, it could very well be issues with high volumen.

I believe these issues are basically caused by a high volume of requests — we see these issues with the agent which polls frequently and now with run_deploment which also polls frequently.

@MuFaheemkhan could you please edit your post to contain the full Docker image tag if you are using one of our official images or include the versions of the libraries as requested? A full traceback for the error would also be really helpful. Thanks!