pandas: to_sql is too slow

Code Sample,

df_name.to_sql('table_name',
                          schema = 'public',
                          con = engine,
                          index = False,
                          if_exists = 'replace')

Problem description

Im writing a 500,000 row dataframe to a postgres AWS database and it takes a very, very long time to push the data through.

It is a fairly large SQL server and my internet connection is excellent so I’ve ruled those out as contributing to the problem.

In comparison, csv2sql or using cat and piping into psql on the command line is much quicker.

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 26 (3 by maintainers)

Links to this issue

Improve CSV push to MySQL with virtual files in python - Stack Overflow

Most upvoted comments

I’ve attempted to run this fix, but run into an error message:

AttributeError: 'psycopg2.extensions.cursor' object has no attribute 'fast_executemany'

Anyone know what’s going on?

+26

tim-sauchuk on Dec 12, 2018

Add this code below engine = create_engine(connection_string):

from sqlalchemy import event

@event.listens_for(e, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True
        cursor.commit()

In my code, to_sql function was taking 7 min to execute, and now it takes only 5 seconds 😉

+26

llautert on Nov 29, 2018

A new version of pandas contains method parameter which could be chosen to be ‘multi’. This makes the code run much faster.

+12

KamranMK on Aug 12, 2019

fast_executemany can be performed in a single step now (sqlalchemy >= 1.3.0):

engine = sqlalchemy.create_engine(connection_string, fast_executemany=True)

Maybe it’s worth mentioning it somewhere in the docs or with an example? It’s a particular case not related to pandas, but it’s a small addition that could drastically improve the performance in many cases.

+10

giuliobeseghi on Jan 9, 2020

Thanks @llautert! That helped a lot!

# dont forget to import event
from sqlalchemy import event, create_engine

engine = create_engine(connection_string)

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True
        cursor.commit()

dean12 on Nov 29, 2018

see here：https://stackoverflow.com/questions/48006551/speeding-up-pandas-dataframe-to-sql-with-fast-executemany-of-pyodbc

lllong33 on May 14, 2019

For postgres users, I recommend setting method to a callable:

callable with signature (pd_table, conn, keys, data_iter): This can be used to implement a more performant insertion method based on specific backend dialect features.

and call the function from the example code here https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#insertion-method and

Using COPY FROM is really a lot faster 🚀

feluelle on Aug 27, 2020

A new version of pandas contains method parameter which could be chosen to be ‘multi’. This makes the code run much faster.

You’d think that setting the chunksize parameter would be enough to make to_sql batch insert but nope.

json2d on Mar 15, 2020

I tried it running sqlalchemy: 1.2.4-py35h14c3975_0 and 1.2.11-py35h7b6447c_0

but I am getting

AttributeError: ‘psycopg2.extensions.cursor’ object has no attribute ‘fast_executemany’

pedrovgp on Mar 31, 2019

see here: http://stackoverflow.com/questions/33816918/write-large-pandas-dataframes-to-sql-server-database

with SQLServer you need to import via csv with a bulk upload for efficiency

jreback on Feb 1, 2017

For postgres users, I recommend setting method to a callable:

callable with signature (pd_table, conn, keys, data_iter): This can be used to implement a more performant insertion method based on specific backend dialect features.

and call the function from the example code here https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#insertion-method and

Using COPY FROM is really a lot faster 🚀

This is the only solution that worked for me for Postgres, thank you @feluelle! This took my query from 20 min to < 1 min.

I had tried method='multi' and various chunksizes, but it was slower than without it. Also tried executemany_mode='batch' in the sqlalchemy engine but it threw an “Invalid value” error and I couldn’t find a solution to that.

mzw4 on Mar 3, 2023

I tried it running sqlalchemy: 1.2.4-py35h14c3975_0 and 1.2.11-py35h7b6447c_0

but I am getting

AttributeError: ‘psycopg2.extensions.cursor’ object has no attribute ‘fast_executemany’

You are using psycopg2, which is a postgresql driver. This issue and fix pertain to Microsoft SQL Server using the pyodbc driver.

vaneseltine on Jun 4, 2019

Does anyone know how I can implement this solution inside a class with a self.engine instance?

Works for me by refering to self.engine

Example:

    self.engine = sqlalchemy.create_engine(connectionString, echo=echo)
    self.connection = self.engine.connect()

    @event.listens_for(self.engine, 'before_cursor_execute')
    def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
        print("Listen before_cursor_execute - executemany: %s" % str(executemany))
        if executemany:
            cursor.fast_executemany = True
            cursor.commit()

DanielOverdevest on Feb 15, 2019

For future readers on this, there are two options to use a ‘batch_mode’ for to_sql. The following are the two combinations:

create_engine(connection_string, executemany_mode='batch', executemany_batch_page_size=x)

create_engine(connection_string, executemany_mode='values', executemany_values_page_size=x)

Details on these arguments can be found here: https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#psycopg2-fast-execution-helpers

jpuerto-psc on Aug 24, 2020