pandas: When using to_sql(), continue if duplicate primary keys are detected?

Code Sample, a copy-pastable example if possible

df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)

Problem description

I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql() completely stops executing if even one duplicate is detected.

It would make sense for to_sql(if_exists='append') to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.

Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists option like 'append_skipdupes'?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: English_United States.1252

pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.12.0 scipy: None statsmodels: None xarray: None IPython: 5.3.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 httplib2: None apiclient: None sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.5 boto: None pandas_datareader: None

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 459
  • Comments: 41 (5 by maintainers)

Most upvoted comments

This should also support the “on duplicate update” mode as well.

append_skipdupes would be the perfect way to handle this.

A workaround would be to remove the unique index in the database:

sqlquery="ALTER 'TABLE DATABASE'.'TABLE' DROP INDEX 'idx_name'" afterwards df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False) can be executed.

Just let your MySQL Server add the index again and drop the duplicates. sqlquery="ALTER IGNORE TABLE 'DATABASE'.'TABLE' ADD UNIQUE INDEX 'idx_name' ('column_name1' ASC, 'column_name2' ASC, 'column_name3' '[ASC | DESC]')"

Depending on your specific application, this can be helpful. Anyway if_exists option like append_skipdupes would be much better.

Yes, please. ‘append_skipdupes’ should be added and not only for the Primary Key column. If there are duplicates among other Unique columns also it should skip appending those new duplicate rows.

yes, append_skipdupes +1

append_skipdupes +1

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here’s the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don’t need to update. It’s just that the form of the data feed is such that I’ll get duplicates that are ignorable.

A problem not solved from 2017 to 2022?
if_exists operates on table level, an extra keyword arg for ‘skip_duplicates’ is also acceptable

Agree ‘append_skipdupes’ should be added.

I have made small script for my use to allow INSERT IGNORE in mysql:

NOTE: This is copy paste from my Database class, please adjust for your use!

    def save_dataframe(self, df: pd.DataFrame, table: str):
        '''
        Save dataframe to the database. 
        Index is saved if it has name. If it's None it will not be saved.
        It implements INSERT IGNORE when inserting rows into the MySQL table.
        Table needs to exist before. 

        Arguments:
            df {pd.DataFrame} -- dataframe to save
            table {str} -- name of the db table
        '''
        if df.index.name is None:
            save_index = False
        else:
            save_index = True

        self._insert_conflict_ignore(df=df, table=table, index=save_index)

   
    def _insert_conflict_ignore(self, df: pd.DataFrame, table: str, index: bool):
        """
        Saves dataframe to the MySQL database with 'INSERT IGNORE' query.
        
        First it uses pandas.to_sql to save to temporary table.
        After that it uses SQL to transfer the data to destination table, matching the columns.
        Destination table needs to exist already. 
        Final step is deleting the temporary table.

        Parameters
        ----------
        df : pd.DataFrame
            dataframe to save
        table : str
            destination table name
        """
        # generate random table name for concurrent writing
        temp_table = ''.join(random.choice(string.ascii_letters) for i in range(10))
        try:
            df.to_sql(temp_table, self.conn, index=index)
            columns = self._table_column_names(table=temp_table)
            insert_query = f'INSERT IGNORE INTO {table}({columns}) SELECT {columns} FROM `{temp_table}`'
            self.conn.execute(insert_query)
        except Exception as e:
            print(e)        

        # drop temp table
        drop_query = f'DROP TABLE IF EXISTS `{temp_table}`'
        self.conn.execute(drop_query)


    def _table_column_names(self, table: str) -> str:
        """
        Get column names from database table

        Parameters
        ----------
        table : str
            name of the table

        Returns
        -------
        str
            names of columns as a string so we can interpolate into the SQL queries
        """
        query = f"SELECT column_name FROM information_schema.columns WHERE table_name = '{table}'"
        rows = self.conn.execute(query)
        dirty_names = [i[0] for i in rows]
        clean_names = '`' + '`, `'.join(map(str, dirty_names)) + '`'
        return clean_names

https://gist.github.com/tombohub/0c666583c48c1686c736ae2eb76cb2ea

Is this issue resolved yet?

+1 for append_skipdupes

The official discussion in #49246 suggest that this issue seems not to be the current focus point of pandas, suggest closing.

https://github.com/pandas-dev/pandas/issues/15988#issuecomment-823602644 from tombohub seems a workaround for it.

For upsert, just replace insert ignore.

Meantime you can use this https://pypi.org/project/pangres/

an other workaround with MariaDb and MySql : df.to_csv("test.csv") then use : LOAD DATA INFILE 'test.csv' IGNORE INTO TABLE mytable or LOAD DATA INFILE 'test.csv' REPLACE INTO TABLE mytable.

LOAD DATA is very faster than INSERT.

complete code:

csv_path = str(Path(application_path) / "tmp" / "tmp.csv").replace("\\", "\\\\")
df.to_csv(csv_path, index=False, sep='\t', quotechar="'", na_rep=r'\N')
rq = """LOAD DATA LOCAL INFILE '{file_path}' REPLACE INTO TABLE {db}.{db_table}
        LINES TERMINATED BY '\\r\\n'
        IGNORE 1 LINES
         ({col});
        """.format(db=db,
                   file_path=csv_path,
                   db_table=table_name,
                   col=",".join(df.columns.tolist()))

+1 for append_skipdupes. IMO, an option to update the duplicates would also be nice. Perhaps append_updatedupes.

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here’s the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don’t need to update. It’s just that the form of the data feed is such that I’ll get duplicates that are ignorable.

Is there a postgresql equivalent for this?

+1 for append_skipdupes