sqlalchemy: Memory leak when connection pool is constantly overflown? (asyncpg)

Discussed in https://github.com/sqlalchemy/sqlalchemy/discussions/7058

<div type='discussions-op-text'>

Originally posted by mvbrn September 21, 2021 We were using default settings of the connection pool (pool_size=5 and overflow=10) and after several hours of higher load, we encountered OOM’s. It was constantly processing more than 5 tasks concurrently and the pool constantly was creating/removing extra connections allowed by overflow. During the debugging, we found out that asyncpg/connect_utils.py:705: is constantly eating more and more memory, reaching hundreds of megabytes.

We attach a simple script to simulate a constant pool overflow by printing some stats.

import asyncio
import gc
import time
import tracemalloc
from asyncio import current_task

from sqlalchemy.ext.asyncio import async_scoped_session, AsyncSession, create_async_engine
from sqlalchemy.orm import sessionmaker


DB_URL = 'postgresql+asyncpg://<your-url-here>'


def create_async_scoped_db_session(
    url: str, statement_timeout_ms: int = 60 * 1000
) -> async_scoped_session:
    session = async_scoped_session(sessionmaker(class_=AsyncSession), scopefunc=current_task)

    url += f'?command_timeout={statement_timeout_ms}'
    engine = create_async_engine(url)

    session.configure(bind=engine)

    return session


async def job(id, dbs, wait_for=1):
    # just do some dummy stuff
    print(f'Starting waiting job #{id} for {wait_for}s at {time.time()}. {dbs.bind.engine.pool.status()}')
    await dbs.execute(f'select 1;')
    await dbs.execute("SELECT table_name FROM information_schema.tables WHERE table_schema='public';")
    print(f'Waiting job #{id} ready at {time.time()}')
    await dbs.execute(f'select pg_sleep({wait_for});')
    print(f'Finished waiting job #{id} for {wait_for}s at {time.time()}')
    await dbs.close()
    await dbs.remove()
    print(f'Starting waiting job #{id} cleaned up {time.time()}. {dbs.bind.engine.pool.status()}')


async def run():
    tracemalloc.start()
    dbs = create_async_scoped_db_session(DB_URL)
    snapshot1 = tracemalloc.take_snapshot()

    num_rounds = 300
    num_jobs = 10

    for _ in range(num_rounds):
        await asyncio.gather(*(job(id=i+1, dbs=dbs) for i in range(num_jobs)))
        snapshot2 = tracemalloc.take_snapshot()
        top_stats = snapshot2.compare_to(snapshot1, 'lineno')

        print("[ Top 10 differences ]")
        for stat in top_stats[:10]:
            print(stat)

        snapshot1 = snapshot2

    snapshot1 = snapshot2

    await dbs.bind.dispose()

    print('Stats after disposal')

    snapshot2 = tracemalloc.take_snapshot()
    top_stats = snapshot2.compare_to(snapshot1, 'lineno')

    print("[ Top 10 differences ]")
    for stat in top_stats[:10]:
        print(stat)

    print('Running garbage collection')

    gc.collect()

    print('Stats after collection')

    snapshot2 = tracemalloc.take_snapshot()
    top_stats = snapshot2.compare_to(snapshot1, 'lineno')

    print("[ Top 10 differences ]")
    for stat in top_stats[:10]:
        print(stat)

    top_stats = snapshot2.statistics('lineno')

    print("[ Top 10 - FINAL stats ]")
    for stat in top_stats[:10]:
        print(stat)


if __name__ == "__main__":
    asyncio.run(
        run()
    )

after we run this dummy script locally, we end up with stats like

[ Top 10 - FINAL stats ]
/usr/lib/python3.9/asyncio/sslproto.py:545: size=3373 KiB, count=36169, average=96 B
./python3.9/site-packages/asyncpg/connect_utils.py:705: size=2611 KiB, count=25109, average=106 B

...

/asyncio/sslproto.py:545: and asyncpg/connect_utils.py:705 takes 2.5 MBs even after the engine disposal and manual garbage collection. If set num_jobs to 5, the pool won’t be overflown and these are disappearing from top tracemalloc stats. So it is kind of reproduces the issue we have.

We may increase pool settings, but we can’t be sure that there won’t be a bigger load and it won’t start leaking again.

I’m not sure whether it is SQLAlchemy’s or asyncpg’s issue, but we do not use asyncpg directly so I’m posting it here.

our versions:

  • OS: Ubuntu 20.04
  • Python: 3.9.6
  • SQLAlchemy: 1.4.23
  • Database: Postgresql 12.5
  • DBAPI: (no idea)
</div>

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 21 (20 by maintainers)

Most upvoted comments

Thank you very much for digging into this! 😃

Just want to confirm that on my machine when I remove command_timeout leak disappears too.

yup i am watching top with the totally original test and it stabilizes when the command timeout isn’t present. whew !

I think we can close. Raised in asyncpg at https://github.com/MagicStack/asyncpg/issues/830