pandas: When using to_sql(), continue if duplicate primary keys are detected?

Code Sample, a copy-pastable example if possible

df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)

Problem description

I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql() completely stops executing if even one duplicate is detected.

It would make sense for to_sql(if_exists='append') to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.

Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists option like 'append_skipdupes'?

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: English_United States.1252

pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.12.0 scipy: None statsmodels: None xarray: None IPython: 5.3.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 httplib2: None apiclient: None sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.5 boto: None pandas_datareader: None

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 459
Comments: 41 (5 by maintainers)

Links to this issue

python - Pandas to_sql fails on duplicate primary key - Stack Overflow

Most upvoted comments

This should also support the “on duplicate update” mode as well.

+111

rockg on Apr 13, 2017

append_skipdupes would be the perfect way to handle this.

+47

cgi1 on May 14, 2019

A workaround would be to remove the unique index in the database:

sqlquery="ALTER 'TABLE DATABASE'.'TABLE' DROP INDEX 'idx_name'" afterwards df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False) can be executed.

Just let your MySQL Server add the index again and drop the duplicates. sqlquery="ALTER IGNORE TABLE 'DATABASE'.'TABLE' ADD UNIQUE INDEX 'idx_name' ('column_name1' ASC, 'column_name2' ASC, 'column_name3' '[ASC | DESC]')"

Depending on your specific application, this can be helpful. Anyway if_exists option like append_skipdupes would be much better.

+27

valewyss on Apr 16, 2019

Yes, please. ‘append_skipdupes’ should be added and not only for the Primary Key column. If there are duplicates among other Unique columns also it should skip appending those new duplicate rows.

+18

rahullak on Sep 21, 2020

yes, append_skipdupes +1

+18

macdet on Jun 28, 2019

append_skipdupes +1

+16

iveteran on Jun 20, 2020

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here’s the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don’t need to update. It’s just that the form of the data feed is such that I’ll get duplicates that are ignorable.

+16

jtkiley on Aug 6, 2019

A problem not solved from 2017 to 2022?
if_exists operates on table level, an extra keyword arg for ‘skip_duplicates’ is also acceptable

+11

redreamality on Jun 23, 2022

Agree ‘append_skipdupes’ should be added.

+10

Arham-Aalam on Aug 26, 2020

I have made small script for my use to allow INSERT IGNORE in mysql:

NOTE: This is copy paste from my Database class, please adjust for your use!

    def save_dataframe(self, df: pd.DataFrame, table: str):
        '''
        Save dataframe to the database. 
        Index is saved if it has name. If it's None it will not be saved.
        It implements INSERT IGNORE when inserting rows into the MySQL table.
        Table needs to exist before. 

        Arguments:
            df {pd.DataFrame} -- dataframe to save
            table {str} -- name of the db table
        '''
        if df.index.name is None:
            save_index = False
        else:
            save_index = True

        self._insert_conflict_ignore(df=df, table=table, index=save_index)

   
    def _insert_conflict_ignore(self, df: pd.DataFrame, table: str, index: bool):
        """
        Saves dataframe to the MySQL database with 'INSERT IGNORE' query.
        
        First it uses pandas.to_sql to save to temporary table.
        After that it uses SQL to transfer the data to destination table, matching the columns.
        Destination table needs to exist already. 
        Final step is deleting the temporary table.

        Parameters
        ----------
        df : pd.DataFrame
            dataframe to save
        table : str
            destination table name
        """
        # generate random table name for concurrent writing
        temp_table = ''.join(random.choice(string.ascii_letters) for i in range(10))
        try:
            df.to_sql(temp_table, self.conn, index=index)
            columns = self._table_column_names(table=temp_table)
            insert_query = f'INSERT IGNORE INTO {table}({columns}) SELECT {columns} FROM `{temp_table}`'
            self.conn.execute(insert_query)
        except Exception as e:
            print(e)        

        # drop temp table
        drop_query = f'DROP TABLE IF EXISTS `{temp_table}`'
        self.conn.execute(drop_query)


    def _table_column_names(self, table: str) -> str:
        """
        Get column names from database table

        Parameters
        ----------
        table : str
            name of the table

        Returns
        -------
        str
            names of columns as a string so we can interpolate into the SQL queries
        """
        query = f"SELECT column_name FROM information_schema.columns WHERE table_name = '{table}'"
        rows = self.conn.execute(query)
        dirty_names = [i[0] for i in rows]
        clean_names = '`' + '`, `'.join(map(str, dirty_names)) + '`'
        return clean_names

https://gist.github.com/tombohub/0c666583c48c1686c736ae2eb76cb2ea

tombohub on Apr 17, 2023

Is this issue resolved yet?

seanjedi on Dec 1, 2022

+1 for append_skipdupes

kxbin on Jan 19, 2021

The official discussion in #49246 suggest that this issue seems not to be the current focus point of pandas, suggest closing.

https://github.com/pandas-dev/pandas/issues/15988#issuecomment-823602644 from tombohub seems a workaround for it.

For upsert, just replace insert ignore.

redreamality on Feb 16, 2023

Meantime you can use this https://pypi.org/project/pangres/

BuSHari on Nov 29, 2020

an other workaround with MariaDb and MySql : df.to_csv("test.csv") then use : LOAD DATA INFILE 'test.csv' IGNORE INTO TABLE mytable or LOAD DATA INFILE 'test.csv' REPLACE INTO TABLE mytable.

LOAD DATA is very faster than INSERT.

complete code:

csv_path = str(Path(application_path) / "tmp" / "tmp.csv").replace("\\", "\\\\")
df.to_csv(csv_path, index=False, sep='\t', quotechar="'", na_rep=r'\N')
rq = """LOAD DATA LOCAL INFILE '{file_path}' REPLACE INTO TABLE {db}.{db_table}
        LINES TERMINATED BY '\\r\\n'
        IGNORE 1 LINES
         ({col});
        """.format(db=db,
                   file_path=csv_path,
                   db_table=table_name,
                   col=",".join(df.columns.tolist()))

netchose on Oct 24, 2019

+1 for append_skipdupes. IMO, an option to update the duplicates would also be nice. Perhaps append_updatedupes.

singhal2 on Apr 7, 2021

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here’s the workaround I use in sqlite:
CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...
Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don’t need to update. It’s just that the form of the data feed is such that I’ll get duplicates that are ignorable.

Is there a postgresql equivalent for this?

benboughton1 on May 8, 2022