sqlalchemy: pyodbc w SQL Server causes memory overflow when fast_executemany=True (keep this at its default of False if you're having this problem)
I have written a python script that reads files one-by-one out of a blob storage account, reformat them into a data frame and then writes them to a SQL Server DB using SQL Alchemy for daily log files.
Each file is 300K - 900K records and each day can have up to 75 files. The problem comes in that the VM I am running the code on runs out of memory at right around file 50. Using tracemalloc I have found that the file
"C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py:585"
grows in memory with every loop while nothing else does. gc.collect is called at the end of every loop.
General pseudo code structure is:
For blob.name in blob:
read in files
log_df = clean files into df
engine = create_engine('mssql+pyodbc://creds/DB?driver=ODBC Driver 17 for SQL Server', fast_executemany=True)
log_df.to_sql(table_name,engine,schema='dbo',if_exists='append', chunksize =1000, method=None, index=False)
engine.dispose()
del engine
del log_df
gc.collect()
I have tried with the engine only being declared once before the loop, as well as using
conn = engine.connect()
df.to_sql(conn, etc)
conn.close()
conn.invaldiate()
None of those impacted the memory buildup.
Thank you for your help.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 25 (20 by maintainers)
Commits related to this issue
- Add note that fast_executemany uses memory Ideally this would be a per-execution option, or Pyodbc could perhaps run the data in chunks. Fixes: #5334 Change-Id: If4a11b312346b8e4c2b8cd38840b3a2ba56d... — committed to sqlalchemy/sqlalchemy by zzzeek 4 years ago
- Add note that fast_executemany uses memory Ideally this would be a per-execution option, or Pyodbc could perhaps run the data in chunks. Fixes: #5334 Change-Id: If4a11b312346b8e4c2b8cd38840b3a2ba56d... — committed to sqlalchemy/sqlalchemy by zzzeek 4 years ago
@solleer - Okay, so
to_sqldefaults tovarchar(max)columns for strings and that appears to be the cause of the memory bloat (see pyodbc issue above). You can limit the width of your varchar columns withdtype={"your_text_column": sa.String(255)}(or similar) in yourto_sqlcall.Not yet; I’d want to at least give them a manageable MCVE to work with. There’s also the question of whether pandas is giving pyodbc something that is “wrong” vs. something that pyodbc “doesn’t particularly like for some reason”.