aws-sdk-js-v3: ECONNRESET exceptions when running in Lambda environment

Describe the bug

import { S3 } from '@aws-sdk/client-s3';
import { Handler, Context, S3Event } from 'aws-lambda';

const s3 = new S3({})

export const handler: Handler = async (event: S3Event, context: Context) => {
  await s3.getObject({
    Bucket: event.Records[0].s3.bucket.name,
    Key: event.Records[0].s3.object.key,
  });
}

We have this very basic lambda function that reads the file from S3 when a new file is uploaded (we actually consume the Body stream too, left that out for brevity). The function is called intermittently meaning that sometimes we get a new Lambda function (i.e. cold) sometimes the Lambda container is reused. When the container is reused, we sometimes see a ECONNRESET exception such as this one

2020-05-20T16:50:28.107Z	d7a43394-afad-4267-a4a4-5ad3633a1db8	ERROR	Error: socket hang up
    at connResetException (internal/errors.js:608:14)
    at TLSSocket.socketOnEnd (_http_client.js:460:23)
    at TLSSocket.emit (events.js:322:22)
    at endReadableNT (_stream_readable.js:1187:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  code: 'ECONNRESET',
  '$metadata': { retries: 0, totalRetryDelay: 0 }
}

I’m pretty confident that this is due to the keep-alive nature of the https connection. Lambda processes are frozen after they execute and their host seems to terminate open sockets after ~10 minutes. The next time the S3 client tries to reuse the socket, the exception is thrown.

We are running into similar issues with connections to our Aurora database which also terminates intermittently with the same error message (see https://github.com/brianc/node-postgres/issues/2112). It’s an error we can easily recover from if we try to reopen the socket but aws-sdk-v3 seems to prefer to throw an error message instead.

Is the issue in the browser/Node.js? Node.js 12.x on AWS Lambda

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 20
  • Comments: 35 (8 by maintainers)

Most upvoted comments

I’ve managed to work around this using this configuration (updated for gamma):

import {
    StandardRetryStrategy,
    defaultRetryDecider,
} from '@aws-sdk/middleware-retry';
import { SdkError } from '@aws-sdk/smithy-client';

const retryDecider = (err: SdkError & { code?: string }) => {
    if (
        'code' in err &&
        (err.code === 'ECONNRESET' ||
            err.code === 'EPIPE' ||
            err.code === 'ETIMEDOUT')
    ) {
        return true;
    } else {
        return defaultRetryDecider(err);
    }
};
// eslint-disable-next-line @typescript-eslint/require-await
const retryStrategy = new StandardRetryStrategy(async () => '3', {
    retryDecider,
});
export const defaultClientConfig = {
    maxRetries: 3,
    retryStrategy,
};

It would be nice if this was built-in to defaultRetryDecider. Although, is there an argument for this being built-in to the NodeHttpHandler, as this is a node-specific error, and one where the handler should probably “just work”?

Info AWS lambda Node.js 12.x “@aws-sdk/client-dynamodb”: “^1.0.0-gamma.1”

Lambda

import { DynamoDBClient, DescribeTableCommand } from "@aws-sdk/client-dynamodb"

const dynamo = new DynamoDBClient({})

export const tempDebug = async (): Promise<object> => {
  const res = await dynamo.send(new DescribeTableCommand({
    TableName: '<TableName>'
  }))

  return Promise.resolve(res.Table)
}

Local

import { LambdaClient, InvokeCommand } from "@aws-sdk/client-lambda"

declare const TextDecoder
const lambda = new LambdaClient({})

;(async () => {
  let counter = 0
  // eslint-disable-next-line no-constant-condition
  while (true) {
    console.log(counter)
    counter++
    const res = await lambda.send(new InvokeCommand({
      FunctionName: '<FunctionName>'
    }))
    
    const obj = JSON.parse(new TextDecoder("utf-8").decode(res.Payload))
    if (obj.errorType === 'Error') {
      console.log(obj)
      break
    }

    //await new Promise(resolve => setTimeout(resolve, 5 * 60 * 1000))
    await new Promise(resolve => setTimeout(resolve, 90 * 1000))
  }
})()

Produces the following errors consistently when run with 5 minute 90 sec intervals. First call works, second call after 5 minutes 90 seconds produce 1 of the following 2 errors. Error logs are from CloudWatch.

{
    "errorType": "Error",
    "errorMessage": "write EPIPE",
    "code": "EPIPE",
    "errno": "EPIPE",
    "syscall": "write",
    "$metadata": {
        "retries": 0,
        "totalRetryDelay": 0
    },
    "stack": [
        "Error: write EPIPE",
        "    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:92:16)",
        "    at writevGeneric (internal/stream_base_commons.js:132:26)",
        "    at TLSSocket.Socket._writeGeneric (net.js:782:11)",
        "    at TLSSocket.Socket._writev (net.js:791:8)",
        "    at doWrite (_stream_writable.js:401:12)",
        "    at clearBuffer (_stream_writable.js:519:5)",
        "    at TLSSocket.Writable.uncork (_stream_writable.js:338:7)",
        "    at ClientRequest._flushOutput (_http_outgoing.js:862:10)",
        "    at ClientRequest._flush (_http_outgoing.js:831:22)",
        "    at _http_client.js:315:47"
    ]
}
{
    "errorType": "Error",
    "errorMessage": "socket hang up",
    "code": "ECONNRESET",
    "$metadata": {
        "retries": 0,
        "totalRetryDelay": 0
    },
    "stack": [
        "Error: socket hang up",
        "    at connResetException (internal/errors.js:608:14)",
        "    at TLSSocket.socketOnEnd (_http_client.js:453:23)",
        "    at TLSSocket.emit (events.js:322:22)",
        "    at endReadableNT (_stream_readable.js:1187:12)",
        "    at processTicksAndRejections (internal/process/task_queues.js:84:21)"
    ]
}

Works as expected when run with 1 minute intervals.

This issue is fixed in https://github.com/aws/aws-sdk-js-v3/pull/1693, and will be published in rc.7 on Thursday 11/19

Hi @rraziel, I’m currently looking into how JS SDK v2 handles this and will provide a fix in v3 accordingly.

are you saying this is “just” an error that’s not properly handled?

The current behavior in undesirable, and the SDK should retry the error instead of asking user to do it.

Using the fix of serverless-nextjs fixed it for me. This is not at all a permanent solution as it will requery continuously when the matched status code will get returned.

TS implementation:

import type { SdkError } from '@aws-sdk/smithy-client'
import {
	defaultRetryDecider,
	StandardRetryStrategy,
} from '@aws-sdk/middleware-retry'

// fix error in SDK release candidate
// see: https://github.com/aws/aws-sdk-js-v3/issues/1196
// see: https://github.com/serverless-nextjs/serverless-next.js/pull/720/files
export const retryStrategy = new StandardRetryStrategy(async () => 3, {
	retryDecider: (err: SdkError & { code?: string }) => {
		if (
			'code' in err &&
			(err.code === 'ECONNRESET' ||
				err.code === 'EPIPE' ||
				err.code === 'ETIMEDOUT')
		) {
			return true
		} else {
			return defaultRetryDecider(err)
		}
	},
})

import { DynamoDB } from '@aws-sdk/client-dynamodb'
const dynamodbClient = new DynamoDB({ retryStrategy })

the retryStrategy prop is available in all clients AFAIK.

Hoping for an actual fix in the next RC

So we are at release candidate 4 and this problem has not even been acknowledged 😞

Has anyone from AWS or a maintainer even commented on this issue? This seems like this should be a priority given it happens in most use-cases unless you rarely call your lambdas.

iam testing 1.0.0-gamma.10 in production with loging over custom retry strategy

Issues are still happening in 1.0.0-gamma.6 😕

clients in 1.0.0-gamma.3 now retry in case of Transient Errors

It doesn’t check for ECONNRESET, ETIMEDOUT or EPIPE though

@studds Thanks for the elegant solution. This is working perfect for me now.