sqlalchemy: Memory leak when connection pool is constantly overflown? (asyncpg)
Discussed in https://github.com/sqlalchemy/sqlalchemy/discussions/7058
<div type='discussions-op-text'>Originally posted by mvbrn September 21, 2021
We were using default settings of the connection pool (pool_size=5
and overflow=10
) and after several hours of higher load, we encountered OOM’s. It was constantly processing more than 5 tasks concurrently and the pool constantly was creating/removing extra connections allowed by overflow
. During the debugging, we found out that asyncpg/connect_utils.py:705:
is constantly eating more and more memory, reaching hundreds of megabytes.
We attach a simple script to simulate a constant pool overflow by printing some stats.
import asyncio
import gc
import time
import tracemalloc
from asyncio import current_task
from sqlalchemy.ext.asyncio import async_scoped_session, AsyncSession, create_async_engine
from sqlalchemy.orm import sessionmaker
DB_URL = 'postgresql+asyncpg://<your-url-here>'
def create_async_scoped_db_session(
url: str, statement_timeout_ms: int = 60 * 1000
) -> async_scoped_session:
session = async_scoped_session(sessionmaker(class_=AsyncSession), scopefunc=current_task)
url += f'?command_timeout={statement_timeout_ms}'
engine = create_async_engine(url)
session.configure(bind=engine)
return session
async def job(id, dbs, wait_for=1):
# just do some dummy stuff
print(f'Starting waiting job #{id} for {wait_for}s at {time.time()}. {dbs.bind.engine.pool.status()}')
await dbs.execute(f'select 1;')
await dbs.execute("SELECT table_name FROM information_schema.tables WHERE table_schema='public';")
print(f'Waiting job #{id} ready at {time.time()}')
await dbs.execute(f'select pg_sleep({wait_for});')
print(f'Finished waiting job #{id} for {wait_for}s at {time.time()}')
await dbs.close()
await dbs.remove()
print(f'Starting waiting job #{id} cleaned up {time.time()}. {dbs.bind.engine.pool.status()}')
async def run():
tracemalloc.start()
dbs = create_async_scoped_db_session(DB_URL)
snapshot1 = tracemalloc.take_snapshot()
num_rounds = 300
num_jobs = 10
for _ in range(num_rounds):
await asyncio.gather(*(job(id=i+1, dbs=dbs) for i in range(num_jobs)))
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("[ Top 10 differences ]")
for stat in top_stats[:10]:
print(stat)
snapshot1 = snapshot2
snapshot1 = snapshot2
await dbs.bind.dispose()
print('Stats after disposal')
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("[ Top 10 differences ]")
for stat in top_stats[:10]:
print(stat)
print('Running garbage collection')
gc.collect()
print('Stats after collection')
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("[ Top 10 differences ]")
for stat in top_stats[:10]:
print(stat)
top_stats = snapshot2.statistics('lineno')
print("[ Top 10 - FINAL stats ]")
for stat in top_stats[:10]:
print(stat)
if __name__ == "__main__":
asyncio.run(
run()
)
after we run this dummy script locally, we end up with stats like
[ Top 10 - FINAL stats ]
/usr/lib/python3.9/asyncio/sslproto.py:545: size=3373 KiB, count=36169, average=96 B
./python3.9/site-packages/asyncpg/connect_utils.py:705: size=2611 KiB, count=25109, average=106 B
...
/asyncio/sslproto.py:545:
and asyncpg/connect_utils.py:705
takes 2.5 MBs even after the engine disposal and manual garbage collection.
If set num_jobs
to 5, the pool won’t be overflown and these are disappearing from top tracemalloc stats.
So it is kind of reproduces the issue we have.
We may increase pool settings, but we can’t be sure that there won’t be a bigger load and it won’t start leaking again.
I’m not sure whether it is SQLAlchemy’s or asyncpg’s issue, but we do not use asyncpg directly so I’m posting it here.
our versions:
- OS: Ubuntu 20.04
- Python: 3.9.6
- SQLAlchemy: 1.4.23
- Database: Postgresql 12.5
- DBAPI: (no idea)
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 21 (20 by maintainers)
Thank you very much for digging into this! 😃
Just want to confirm that on my machine when I remove
command_timeout
leak disappears too.yup i am watching top with the totally original test and it stabilizes when the command timeout isn’t present. whew !
I think we can close. Raised in asyncpg at https://github.com/MagicStack/asyncpg/issues/830