langchainjs: Retry logic for OpenAI timeouts

I’m seeing the following error in prod:

Error [TimeoutError]: Request timed out.
    at wrapOpenAIClientError (file:///app/node_modules/langchain/dist/util/openai.js:6:17)
    at file:///app/node_modules/langchain/dist/chat_models/openai.js:518:31
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async RetryOperation._fn (/app/node_modules/p-retry/index.js:50:12) {
  attemptNumber: 1,
  retriesLeft: 6
}

It’s getting captured in my catch block, so I’m fairly sure the retries aren’t happening, unless the first attempt is the one that gets re-thrown or something confusing like that. Is it possible this doesn’t meet the criteria for retryable? Could this be addressed using the FailedAttemptHandler interface?

I’ve not set a timeout for the LLM. I’m having a hard time figuring out the default value.

  /**
   * Custom handler to handle failed attempts. Takes the originally thrown
   * error object as input, and should itself throw an error if the input
   * error is not retryable.
   */
  onFailedAttempt?: FailedAttemptHandler;

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Reactions: 1
  • Comments: 28 (7 by maintainers)

Most upvoted comments

🤖

Hello,

From the error message you’ve shared, it seems like the request to OpenAI is timing out. This could be due to a variety of reasons, such as network latency or the OpenAI server taking longer to respond.

Looking at the langchainjs codebase, it appears that the default timeout for OpenAI requests is set to 1000 milliseconds (or 1 second). If the OpenAI server takes longer than this to respond, a TimeoutError will be thrown. You might want to consider increasing this timeout value if your network conditions or the server’s response time require it.

As for the retry logic, the current implementation in langchainjs does not seem to include any retry mechanism for OpenAI requests. The embeddingWithRetry method, which is responsible for making requests to the OpenAI API, does not contain any retry logic. The OpenAIClient instance is created with maxRetries: 0, which means no retries will be attempted in case of a failure.

If you want to implement a retry mechanism, you could modify the embeddingWithRetry method to include a retry loop, or use a library that provides retry functionality. The onFailedAttempt option you mentioned could indeed be used to determine whether an error is retryable or not. You would need to implement this function to check the error and throw it if it’s not retryable, allowing the retry mechanism to catch it and retry the request.

Here’s a simple example of how you could implement this:

const retry = require('p-retry');

const run = async () => {
  try {
    await retry(async () => {
      // Your request logic here
    }, {
      onFailedAttempt: error => {
        if (error.name === 'TimeoutError') {
          console.log(`Attempt ${error.attemptNumber} failed. There are ${error.retriesLeft} retries left.`);
        } else {
          throw error;
        }
      },
      retries: 5
    });
  } catch (error) {
    console.error('All retries failed:', error);
  }
};

run();

In this example, the onFailedAttempt function checks if the error is a TimeoutError, and if it’s not, it throws the error, stopping the retry loop. If it is a TimeoutError, it logs the attempt number and the number of retries left, and allows the retry loop to continue.

I hope this helps! Let me know if you have any other questions.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

I just found out that retries are not happening at all, despite setting maxRetries to a valid value in the langchain ChatOpenAI object.

This sort of bug remaining unsolved for so long makes me doubt whether anyone uses langchainjs at all.

Can the maintainers fix this?

I don’t understand why this is being closed. Seems to me that there is still a clear bug here even if there is a workaround.

@jacoblee93 I understand your point, and it’s valid. In my case, I understood the timeout as the max time it should await for the LLM api to respond, therefore I expected it to retry if it didn’t happen.

Sometimes, for who knows what reason, a request to OpenAI for example might timeout, but the second request be responded almost instantly. So I use to set short timeouts (5 seconds) as waiting for the second retry to succeed is usually much faster than setting a longer timeout and awaiting for the first request to answer.

If it is to stay as it is, I think it should be mentioned in the docs that a timed out request won’t be retried automatically.

Oh, you should also just pass onFailedAttempt where you’d be able to pass maxRetries and it would work as well. I don’t think you’d need to subclass.

If that was their desire won’t they just setup retries to 0?

@codenameakshay what is the status of this PR that is supposed to fix maxRetries param not working?

The PR doesn’t actually fix the bug as I discussed with @jacoblee93. It is still an open issue.

See https://github.com/langchain-ai/langchainjs/pull/3370#discussion_r1402660699

For posterity, this more or less solved my problem, though its probably not the best solution (see below)

const STATUS_NO_RETRY = [
  400, // Bad Request
  401, // Unauthorized
  402, // Payment Required
  403, // Forbidden
  404, // Not Found
  405, // Method Not Allowed
  406, // Not Acceptable
  407, // Proxy Authentication Required
  409, // Conflict
];

// Modifies the defaultFailedAttemptHandler to make request timeouts retryable
const FAILED_ATTEMPT_HANDLER = (error: any) => {
  if (
    error.message.startsWith("Cancel") ||
    error.message.startsWith("AbortError") ||
    error.name === "AbortError"
  ) {
    throw error;
  }

  if ((error as any)?.code === "ECONNABORTED") {
    throw error;
  }
  
  const status =
    (error as any)?.response?.status ?? (error as any)?.status;
  if (status && STATUS_NO_RETRY.includes(+status)) {
    throw error;
  }

  if ((error as any)?.error?.code === "insufficient_quota") {
    const err = new Error(error?.message);
    err.name = "InsufficientQuotaError";
    throw err;
  }
};

Some more details here: https://github.com/openai/openai-node/blob/master/README.md#retries

Certain errors will be automatically retried 2 times by default, with a short exponential backoff. Connection errors (for example, due to a network connectivity problem), 408 Request Timeout, 409 Conflict, 429 Rate Limit, and >=500 Internal errors will all be retried by default.

You can use the maxRetries option to configure or disable this:

Maybe this is all you need, to increase maxRetries on the OAI client? The documentation should state clearly that there are two non-overlapping retry strategies, one in OpenAI and one in LangChain, and what each of them controls.

The default failure handler looks like the culprit:

const STATUS_NO_RETRY = [
  400, // Bad Request
  401, // Unauthorized
  402, // Payment Required
  403, // Forbidden
  404, // Not Found
  405, // Method Not Allowed
  406, // Not Acceptable
  407, // Proxy Authentication Required
  408, // Request Timeout // <<<<<<<<<<<<<<<<<<<<
  409, // Conflict
];

const defaultFailedAttemptHandler = (error: any) => {
  if (
    error.message.startsWith("Cancel") ||
    error.message.startsWith("TimeoutError") || // <<<<<<<<<<
    error.name === "TimeoutError" ||
    error.message.startsWith("AbortError") ||
    error.name === "AbortError"
  ) {
    throw error;
  }
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  if ((error as any)?.code === "ECONNABORTED") {
    throw error;
  }
  const status =
    // eslint-disable-next-line @typescript-eslint/no-explicit-any
    (error as any)?.response?.status ?? (error as any)?.status;
  if (status && STATUS_NO_RETRY.includes(+status)) {
    throw error;
  }
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  if ((error as any)?.error?.code === "insufficient_quota") {
    const err = new Error(error?.message);
    err.name = "InsufficientQuotaError";
    throw err;
  }
};

However, reviewing OpenAI’s documentation:

A `Timeout` error indicates that your request took too long to complete and our server closed the connection. This could be due to a network issue, a heavy load on our services, or a complex request that requires more processing time.
If you encounter a Timeout error, please try the following steps:

>>>>>>>>>>>>>> Wait a few seconds and retry your request. <<<<<<<<<<<< Sometimes, the network congestion or the load on our services may be reduced and your request may succeed on the second attempt.
Check your network settings and make sure you have a stable and fast internet connection. You may need to switch to a different network, use a wired connection, or reduce the number of devices or applications using your bandwidth.
If the issue persists, check out our persistent errors next steps section.

It sounds like this should be retryable to me.