openai-python: Connection reset from long-running or stale API connections

Describe the bug

As we’ve used the openai.ChatCompletion.create (with gpt-3.5-turbo), we’ve had intermittent

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

without a clear reproduction. At first I thought it was https://github.com/openai/openai-python/issues/91 and due to too many open connections to the OpenAI servers. Now I think it looks more like https://github.com/openai/openai-python/issues/368 instead, but I have some hypotheses about it. I’m opening a new issue separate from https://github.com/openai/openai-python/issues/368 in case they’re different. If this is a duplicate, we can feel free to tack on my details there.

My hypothesis is that if you have a long running process (like a web server), and it calls out to OpenAI, that periods of inactivity cause the server side to terminate the connection and it takes a long time for the client to reestablish the connection. I dug into related issues on the requests side (like this one, https://github.com/psf/requests/issues/4937) that hinted at the root cause. Essentially, what I think is happening is that,

  • First connection is made to OpenAI, returns a result, requests maintains a connection under the hood with default keep-alive
  • some time passes, in my experience, around 10 minutes should do
  • New connection is made to OpenAI, but the client throws a ConnectionResetError
    • A new call after this succeeds

I believe that the OpenAI servers are terminating the connection after a brief time (perhaps minutes) but the client still tries to keep it alive.

The reason why I think this is a bug worth reporting is that I think you could modify the client code so it responds more gracefully to these server-side settings. Changing some of the keep-alive settings from the default ones would help out several folks using this.

To Reproduce

  1. Write a long-running program. In our case, we have a Python web server running FastAPI
  2. As part of a route for the server, call OpenAI to do some work. In our case, we’re calling openai.ChatCompletion.create with gpt-3.5-turbo to manipulate some input language and respond back with it
  3. Run the server and call the endpoint once
  4. Wait 10 minutes
  5. Call the endpoint again
  6. You’ll likely get a Connection reset by peer issue on the second call

Code snippets

No response

OS

Linux

Python version

Python v3.8

Library version

openai-python 0.27.2

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 21
  • Comments: 20

Most upvoted comments

@turnham It’ll do the job but one thing you might miss out on is potential speed improvements by reusing persistent connections. The requests docs on Sessions explain this briefly and link to the basic idea. I still think that since OpenAI knows their own server configurations, if they modify the keep-alive settings, that’ll have the most improvement on community use.

In my own case, I switched over to using tenacity, since the OpenAI docs recommend it.

@retry(
    stop=stop_after_attempt(2),
    retry=retry_if_exception_type(openai.error.APIConnectionError),
)
def call_openai():
    ...

@mathcass +1. Yes it would be great if the openai module would take care of all of this.

I don’t think your approach with tenacity retries would have helped with our situation though. Once we had a long running thread get into this state, all retries would fail. So to get that use case working consisently, we had to force the resetting openai’s _thread_context.session, by making sure a cached session was never present to be re-used: https://github.com/openai/openai-python/blob/fe3abd16b582ae784d8a73fd249bcdfebd5752c9/openai/api_requestor.py#L79

But also adding retries sounds like something we should be doing regardless of this issue, so thanks for the pointers to tenacity!

If anyone is looking for a workaround that does not requiring changing to async, the following is working for us. It’s the same idea as hc20k’s workaround above: https://github.com/openai/openai-python/issues/371#issuecomment-1537622984

Using the support added in v0.27.6 to pass in a session we do the following:

        # Pass a new session to the openai module
        openai.requestssession = requests.Session()
      
        # Existing code calling openai
        response = openai.Completion.create(...)
        
        # Close and reset the session
        try:
            openai.requestssession.close()
        except Exception as e:
            logging.exception(e)
        openai.requestssession = None 

or using the ‘with’ syntax:

with requests.Session() as session:
    openai.requestssession = session
    response = openai.Completion.create(...)
    openai.requestssession = None

We’re not sure if setting the openai.requestssession to None is required but we weren’t sure what else might be done with the that attribute in the openai module. In our testing, we are no longer seeing the errors on long-running (web app) threads that make openai calls.

Just to add for anybody who comes looking I have this problem when I’m using a VPN, not sure why. If I shut off the VPN the problem goes away. I am outside the US by the way.

EDIT: It now works with VPN lol

This should be fixed in the beta of our upcoming v1.0.0; can you try it out and let us know whether or not it seems to be resolved?

@microsoftbuild See this. I already have 600s set as the timeout, and this issue also impacts retries on 502 from Cloudflare.

I then tried specifying a request_timeout parameter for the OpenAI API request, but that caused every request to timeout, due to this issue: https://github.com/openai/openai-python/pull/387

Appreciate your help!