aws-sdk-js-v3: Very slow parallel S3 GetObjectCommand executions (maxSockets)
Describe the bug
Executing many parallel S3 GetObjectCommand is extremely slow in direct comparison to v2 of the sdk at best, suspected of breaking Lambda executions at worst.
Your environment
SDK version number
@aws-sdk/client-s3@3.76.0
Is the issue in the browser/Node.js/ReactNative?
Node.js
Details of the browser/Node.js/ReactNative version
node -v: v14.18.3
Steps to reproduce
I created a bucket containing 159 files (0 byte, size does not seem to be a factor). Then I created the same functionality of getting those files in parallel in minimal scripts for both v3 and v2 of the AWS sdk.
TS Code for v3:
import {
S3Client,
GetObjectCommand,
ListObjectsV2Command,
} from "@aws-sdk/client-s3";
const [nodePath, scriptPath, bucketName] = process.argv;
(async () => {
try {
const s3Client = new S3Client({
region: "us-east-1",
});
console.log(new Date().toISOString());
const files = await s3Client.send(
new ListObjectsV2Command({
Bucket: bucketName,
})
);
console.log(new Date().toISOString());
const getPromises = [];
if (files.Contents) {
for (const file of files.Contents) {
if (file.Key) {
getPromises.push(
s3Client.send(
new GetObjectCommand({
Bucket: bucketName,
Key: file.Key,
})
)
);
}
}
}
const result = await Promise.all(getPromises);
console.log(result.length);
} catch (e) {
console.log(e);
}
})();
TS Code for v2:
import S3 from "aws-sdk/clients/s3";
const [nodePath, scriptPath, bucketName] = process.argv;
(async () => {
try {
const s3Client = new S3({
region: "us-east-1",
});
console.log(new Date().toISOString());
const files = await s3Client
.listObjectsV2({
Bucket: bucketName,
})
.promise();
console.log(new Date().toISOString());
const getPromises = [];
if (files.Contents) {
for (const file of files.Contents) {
if (file.Key) {
getPromises.push(
s3Client.getObject({
Bucket: bucketName,
Key: file.Key!,
}).promise()
);
}
}
}
const result = await Promise.all(getPromises)
console.log(result.length)
} catch (e) {
console.log(e);
}
})();
Observed behavior
After transpiling, I executed both versions multiple times via time node dist/index.js <bucket-name>
. There is a huge gap in the execution time between them. I added logs of the timestamp before and after the listObjects command to verify that command isn’t the actual issue.
Representative outputs for the difference in execution time I experienced across all runs consistently:
v2
aws-sdk-v2$ time node dist/index.js <bucket-name>
2022-04-25T13:04:08.423Z
2022-04-25T13:04:09.580Z
159
real 0m1,352s
user 0m0,314s
sys 0m0,009s
v3
aws-sdk-v3$ time node dist/index.js <bucket-name>
2022-04-25T13:03:18.831Z
2022-04-25T13:03:19.996Z
159
real 0m27,881s
user 0m1,456s
sys 0m0,176s
On my machine, it’s “just” 20 times slower - a lambda function I have which does a similar thing (albeit with more files - around 1100) now after migration from v2 to v3 just returns “null” at that point in the execution when that is not an available return value in any of the execution paths of the code. No error message logged that I could provide unfortunately.
Expected behavior
Similar speed to v2 of the sdk in general, not just ending a Lambda execution.
Screenshots
Additional context
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 2
- Comments: 30 (8 by maintainers)
@rijkvanzanten
I was able to reproduce this issue recently and found a solution. The problem as I can see comes from two areas:
Once the HTTP Agent sockets are all in use, the S3Clients will hang. Your “new client per instance” will work, but you pay a big performance penalty for the huge resource allocations for every request, and losing the TLS session cache and HTTP keep-alive.
Here’s a setting I find that works in my environment. You can tune the
maxSockets
andsocketTimeout
to your environment.Hi @robert-hanuschke ,
I recognize that these kind of edge cases can be frustrating; however, we prioritize fixes and features based on community engagement and feedback. At this time there is no plan to address this issue. If anything changes I will certainly let you know.
Many thanks!
Hi everyone,
Thank you for continuing commenting and providing information. @samjarman Thank you for the repro code, and @askldjd thanks for the workaround. I was able to confirm the described behavior (finally) and noticed that the performance increased with the suggested agent config. We are looking into this with priority.
Thank you for all for your help! Ran~
👋
When downloading hundreds of files from s3 using v3 SDK we experience this issue on lambda which stops SILENTLY working after around 150 files most of the time. Sometimes it works, it depends on the gods of the network.
There are no warning or error logs, even inside the SDK when providing a logger. So we investigate for quite a long time before finding this issue and the possible workarounds.
Is it possible to have a log or an event or something to know when the SDK is stuck because of lack of available sockets?
I don’t think that downloading 200 files to transform them using streams is an edge case and this issue deserves an improvement they would help troubleshooting the issue without reading hundreds of web pages on SO or GitHub.
I recently ran into a very similar situation, where doing a lot of GetObjectCommands in rapid succession would slow down tremendously and eventually lock up the process. Our current workaround is to make a new client instance for each GetObjectCommand and destroying it as soon as the read is done:
I thought that the issue before might’ve been caused by the default keepAlive behavior of the SDK, but explicitly disabling that and lowering the maxSockets didn’t seem to resolve the problem fully 🤔
We are seeing a similar problem and I don’t think this is much of an edge case.
We are pretty much just piping the files from s3 to the client
And if there is a file that takes like a minute to download (slower devices/slower internet) If you download more an more from the server on different clients we see the download go slower and slower until it stops, this is even with the work around.
I may be misunderstanding how to do this, but a similar flow worked great from v2.
Summary of AWS Lambda Challenges and Optimizations
Issue Encountered:
When using AWS Lambda, the “GetObjectCommand” is utilized to handle the same problem in AWS Lambda. When a client sends a request, an error message may occur:
Implementation Details:
I employed Lambda to automate the process of handling image uploads to S3 bucket A. The process involves listening to events in bucket A and triggering Lambda functions. The main steps in the Lambda function implementation are as follows:
Challenges Faced:
One significant challenge encountered was the need for recursive calls within the same bucket. When a user uploads an image to S3, triggering Lambda to process and store the compressed image back into the same bucket can lead to infinite recursion. To address this, I implemented a tagging mechanism to skip compressed processing. However, this approach resulted in suboptimal function call costs and performance.
Optimization Strategies Implemented:
To mitigate the challenges and enhance performance, I made the following optimizations:
keepAlive
to false to resolve certain issues:By implementing these optimizations, I successfully resolved the recursion issue and improved the overall function duration and performance. Future considerations will include separating the triggering and compression processes after bucket separation.
The solution provided in here https://github.com/aws/aws-sdk-js-v3/issues/3560#issuecomment-1484140333 has worked great for us. I’m a little hazy on the details (it’s been a minute), but IIRC it effectively “reverts” the v3 http agent settings back to aws-sdk v2’s settings, which solved the problems for us:
Our problem was that bad requests/operations would never timeout, which combined with the the low default
maxSockets
meant at a certain point all sockets were in use for requests that were long timed out / dead, which in turn made our endpoints “hang” and be unresponsive.Node profiler output comparing v2 and v3
V3 Profile
V2 Profile
Github repo to reproduce issue
That’s a good question. If I am reading correctly, V2 defaults the timeout to 2 minutes. However, V3 defaults the timeout to zero.
That might be the root cause.