sqlalchemy: pyodbc w SQL Server causes memory overflow when fast_executemany=True (keep this at its default of False if you're having this problem)

I have written a python script that reads files one-by-one out of a blob storage account, reformat them into a data frame and then writes them to a SQL Server DB using SQL Alchemy for daily log files.

Each file is 300K - 900K records and each day can have up to 75 files. The problem comes in that the VM I am running the code on runs out of memory at right around file 50. Using tracemalloc I have found that the file "C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py:585" grows in memory with every loop while nothing else does. gc.collect is called at the end of every loop.

General pseudo code structure is:

For blob.name in blob:
     read in files
     log_df = clean files into df
     engine = create_engine('mssql+pyodbc://creds/DB?driver=ODBC Driver 17 for SQL Server', fast_executemany=True)
     log_df.to_sql(table_name,engine,schema='dbo',if_exists='append', chunksize =1000, method=None, index=False)    
     engine.dispose()
     del engine
     del log_df
     gc.collect()

I have tried with the engine only being declared once before the loop, as well as using

conn = engine.connect()
df.to_sql(conn, etc)
conn.close()
conn.invaldiate()

None of those impacted the memory buildup.

Thank you for your help.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 25 (20 by maintainers)

Commits related to this issue

Add note that fast_executemany uses memory Ideally this would be a per-execution option, or Pyodbc could perhaps run the data in chunks. Fixes: #5334 Change-Id: If4a11b312346b8e4c2b8cd38840b3a2ba56d... — committed to sqlalchemy/sqlalchemy by zzzeek 4 years ago
Add note that fast_executemany uses memory Ideally this would be a per-execution option, or Pyodbc could perhaps run the data in chunks. Fixes: #5334 Change-Id: If4a11b312346b8e4c2b8cd38840b3a2ba56d... — committed to sqlalchemy/sqlalchemy by zzzeek 4 years ago

Most upvoted comments

@solleer - Okay, so to_sql defaults to varchar(max) columns for strings and that appears to be the cause of the memory bloat (see pyodbc issue above). You can limit the width of your varchar columns with dtype={"your_text_column": sa.String(255)} (or similar) in your to_sql call.

gordthompson on Aug 6, 2020

We should probably raise this issue in pandas. @gordthompson Have you already done so?

Not yet; I’d want to at least give them a manageable MCVE to work with. There’s also the question of whether pandas is giving pyodbc something that is “wrong” vs. something that pyodbc “doesn’t particularly like for some reason”.

gordthompson on Aug 6, 2020