pandas: to_sql is too slow
Code Sample,
df_name.to_sql('table_name',
schema = 'public',
con = engine,
index = False,
if_exists = 'replace')
Problem description
Im writing a 500,000 row dataframe to a postgres AWS database and it takes a very, very long time to push the data through.
It is a fairly large SQL server and my internet connection is excellent so I’ve ruled those out as contributing to the problem.
In comparison, csv2sql or using cat and piping into psql on the command line is much quicker.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 26 (3 by maintainers)
I’ve attempted to run this fix, but run into an error message:
AttributeError: 'psycopg2.extensions.cursor' object has no attribute 'fast_executemany'Anyone know what’s going on?
Add this code below
engine = create_engine(connection_string):In my code,
to_sqlfunction was taking 7 min to execute, and now it takes only 5 seconds 😉A new version of pandas contains
methodparameter which could be chosen to be ‘multi’. This makes the code run much faster.fast_executemanycan be performed in a single step now (sqlalchemy >= 1.3.0):Maybe it’s worth mentioning it somewhere in the docs or with an example? It’s a particular case not related to pandas, but it’s a small addition that could drastically improve the performance in many cases.
Thanks @llautert! That helped a lot!
see here:https://stackoverflow.com/questions/48006551/speeding-up-pandas-dataframe-to-sql-with-fast-executemany-of-pyodbc
For postgres users, I recommend setting
methodto a callable:and call the function from the example code here https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#insertion-method and
Using
COPY FROMis really a lot faster 🚀You’d think that setting the
chunksizeparameter would be enough to maketo_sqlbatch insert but nope.I tried it running sqlalchemy: 1.2.4-py35h14c3975_0 and 1.2.11-py35h7b6447c_0
but I am getting
see here: http://stackoverflow.com/questions/33816918/write-large-pandas-dataframes-to-sql-server-database
with SQLServer you need to import via csv with a bulk upload for efficiency
This is the only solution that worked for me for Postgres, thank you @feluelle! This took my query from 20 min to < 1 min.
I had tried
method='multi'and various chunksizes, but it was slower than without it. Also triedexecutemany_mode='batch'in the sqlalchemy engine but it threw an “Invalid value” error and I couldn’t find a solution to that.You are using psycopg2, which is a postgresql driver. This issue and fix pertain to Microsoft SQL Server using the pyodbc driver.
Works for me by refering to
self.engineExample:
For future readers on this, there are two options to use a ‘batch_mode’ for to_sql. The following are the two combinations:
create_engine(connection_string, executemany_mode='batch', executemany_batch_page_size=x)or
create_engine(connection_string, executemany_mode='values', executemany_values_page_size=x)Details on these arguments can be found here: https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#psycopg2-fast-execution-helpers