aiobotocore: aiobotocore or aiohttp stuck after many requests

Describe the bug It’s a bug that happens rarely in production. We’ve not been able to reproduce it in an isolated environment yet.

The code in question is a background worker, it pulls a job from SQS and runs it. The job in question was to delete 100s of objects, which means 800 concurrent coro’s each of which issues 1…several concurrent delete_item calls.

gather(800 obj = gather( 1+ delete_item()))

In this case, it just stuck on object 799.

Checklist

  • I have reproduced in environment where pip check passes without errors
  • I have provided pip freeze results
  • I have provided sample code or detailed way to reproduce
  • I have tried the same code in botocore to ensure this is a aiobotocore specific issue

pip freeze results

aiobotocore==0.10.3
aiodns==2.0.0
aiodynamo==19.9
aiohttp==3.6.2
boto3==1.9.189
botocore==1.12.189
requests==2.22.0
requests-toolbelt==0.9.1
urllib3==1.25.6

Environment:

  • Python Version: 3.7.5
  • OS name and version: 3.7.5-alpine10 (official Docker image + deps + our code)

Additional context Thanks to https://github.com/dimaqq/awaitwhat I have a nice coro trace when the process was stuck: Add any other context about the problem here. aiobotocore-stuck

About this issue

Most upvoted comments

In case it helps anyone, we ran into this issue as well recently and were able to work around it with an asyncio Semaphore to ensure we never attempt to make an aiohttp call when there is not an available connection pool slot.

Our specific scenario was:

Uploading thousands of files to s3 from a realtime messaging system. We set max_pool_connections in AoiConfig to N between 100 and 500 and would then launch coroutines for every incoming message we received that would await s3_client.put_object() using the max_pool_connections limit to ultimately put backpressure on the file upload process and rate limit everything back to the message broker queue (rabbitmq).

We found that after uploading ~10-30k files we would reliably find a small number that hung “forever” inside of put_object() and based on reading this issue and linked ones, likely ultimately inside of aoihttp.

For reference, this was tested on:

Windows 10
Selector event loop
aoibotocore == 1.0.4
aiodns  == 2.0.0
aiohttp == 3.6.2
aioitertools == 0.7.0

Note: We also tried downgrading to aoihttp == 3.3.2 and the problem remained unchanged.

While we didn’t yet find the underlying issue, we were able to workaround it reliably by using an external semaphore with a limit set to the same max_pool_connections that we passed to aiobotocore to block concurrent calls into put_object() when we know they will block on the connection pool.

For example:

max_connection_pool = 100

#BEFORE WORKAROUND -  THIS WOULD HANG EVENTUALLY
# Note that this is simplified for clarity, we have a loop launching coroutines that each call
# s3_client.put_object() once.
await s3_client.put_object(Bucket=bucket, Key=path, Body=body_contents)


# WORKAROUND - THIS DOES NOT HANG
# inside shared state between all coroutines
upload_semaphore = asyncio.BoundedSemaphore(max_connection_pool)

# inside each coroutine 
async with upload_semaphore:
        await s3_client.put_object(Bucket=bucket, Key=path, Body=body_contents)

This doesn’t resolve the underlying problem but in case someone else is blocked on this issue urgently and needs a robust workaround, we found the above to work very reliably. It hasn’t been in production long but we’ve gone from 100% hitting this issue to 0% (so far) hitting it and it’s an easy change.

Note that this also lets you put an overall timeout on the s3_client.put_object() if needed as a last resort since you know it will never block waiting for a connection pool slot but only on actually connecting / sending data.

I’ve faced with the problem too while I was uploading 1000 objects to S3 with code similar to this one:

session = aiobotocore.get_session(loop=event_loop)
async with session.create_client('s3') as s3_client:
    awaitables = []
    for instance in instances:
        awaitable = s3_client.put_object(
            Bucket=s3_bucket_name,
            Key=instance.key,
            Body=instance.body,
        )
        awaitables.append(awaitable)
    await asyncio.gather(*awaitables)

It happens around two of five retries, it uploads 999 files and stucks.

Python 3.7 and aiobotocore 0.11.1

Edit: 100 was red herring

Was: 99 done tasks + 1 stuck task == 100 which was suspicious.

Upgrade to the latest aiohttp did not resolve the issue.

@pfreixes I’ll look into that asap, been working on a lot of stuff at work. However related to my httpx comment above I got to a breaking point with aiohttp when downloading weather data from NASA that I swapped one of our prod ingests to httpx and have nothing but praise for it. So I think it may be viable swapping over to solve all these hangs. I don’t think it’ll be too hard. I’ll have to investigate how to make an interface so people can swap between the two (no more)

Hi @thehesiod it’s sad to hear that Aiohttp is not as maintained used to be after all of the giant effort done by the authors and the community.

The connector pool has received many changes for fixing several issues, from what I can see one of the last ones spotted was this one [1], but it seems it was fixed in the 3.6.0 version, so it should not be related with what you are presenting here.

Seems that @illia-v might have caught the bug, so would be nice if someone from Aiohttp could review the PR - If I have time I will do it.

In any case, something that is not really clear to me, from what I can read in the code - correct me if I’m wrong - the connection is never released unless you use a streaming response, otherwise, the returned object to the user is a simple dictionary which does not provide access to the methods for closing or releasing the connection, am I wrong?

[1] https://github.com/aio-libs/aiohttp/pull/3640

Our company has been using 3.3.2 for some time and have never run into this issue, and I’ve seen aiohttp bugs about newer versions hanging, so I want to see if this is an aiohttp regression

actually first I would come up with a reproduceable test case. You can use moto for mocking the backend.

I’ve poked the instance where container is running:

  • security-credentials are updated every hour
  • they are valid for 8h30min
  • process got stuck 40 minutes after latest update, that’s 20 minutes before next 🤔
  • process was running (mostly idle) 20h45min before it got stuck while executing a job

My hunch is, this is unrelated to credentials rollover.

It’s something else, maybe connection pool or DNS or whatnot…