sqlalchemy: pyodbc w SQL Server causes memory overflow when fast_executemany=True (keep this at its default of False if you're having this problem)
I have written a python script that reads files one-by-one out of a blob storage account, reformat them into a data frame and then writes them to a SQL Server DB using SQL Alchemy for daily log files.
Each file is 300K - 900K records and each day can have up to 75 files. The problem comes in that the VM I am running the code on runs out of memory at right around file 50. Using tracemalloc I have found that the file
"C:\ProgramData\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py:585"
grows in memory with every loop while nothing else does. gc.collect is called at the end of every loop.
General pseudo code structure is:
For blob.name in blob:
read in files
log_df = clean files into df
engine = create_engine('mssql+pyodbc://creds/DB?driver=ODBC Driver 17 for SQL Server', fast_executemany=True)
log_df.to_sql(table_name,engine,schema='dbo',if_exists='append', chunksize =1000, method=None, index=False)
engine.dispose()
del engine
del log_df
gc.collect()
I have tried with the engine only being declared once before the loop, as well as using
conn = engine.connect()
df.to_sql(conn, etc)
conn.close()
conn.invaldiate()
None of those impacted the memory buildup.
Thank you for your help.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 25 (20 by maintainers)
Commits related to this issue
- Add note that fast_executemany uses memory Ideally this would be a per-execution option, or Pyodbc could perhaps run the data in chunks. Fixes: #5334 Change-Id: If4a11b312346b8e4c2b8cd38840b3a2ba56d... — committed to sqlalchemy/sqlalchemy by zzzeek 4 years ago
- Add note that fast_executemany uses memory Ideally this would be a per-execution option, or Pyodbc could perhaps run the data in chunks. Fixes: #5334 Change-Id: If4a11b312346b8e4c2b8cd38840b3a2ba56d... — committed to sqlalchemy/sqlalchemy by zzzeek 4 years ago
@solleer - Okay, so
to_sql
defaults tovarchar(max)
columns for strings and that appears to be the cause of the memory bloat (see pyodbc issue above). You can limit the width of your varchar columns withdtype={"your_text_column": sa.String(255)}
(or similar) in yourto_sql
call.Not yet; I’d want to at least give them a manageable MCVE to work with. There’s also the question of whether pandas is giving pyodbc something that is “wrong” vs. something that pyodbc “doesn’t particularly like for some reason”.