pandas: When using to_sql(), continue if duplicate primary keys are detected?
Code Sample, a copy-pastable example if possible
df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)
Problem description
I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql() completely stops executing if even one duplicate is detected.
It would make sense for to_sql(if_exists='append') to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.
Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists option like 'append_skipdupes'?
Output of pd.show_versions()
pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.12.0 scipy: None statsmodels: None xarray: None IPython: 5.3.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 httplib2: None apiclient: None sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.5 boto: None pandas_datareader: None
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 459
- Comments: 41 (5 by maintainers)
This should also support the “on duplicate update” mode as well.
append_skipdupeswould be the perfect way to handle this.A workaround would be to remove the unique index in the database:
sqlquery="ALTER 'TABLE DATABASE'.'TABLE' DROP INDEX 'idx_name'"afterwardsdf.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)can be executed.Just let your MySQL Server add the index again and drop the duplicates.
sqlquery="ALTER IGNORE TABLE 'DATABASE'.'TABLE' ADD UNIQUE INDEX 'idx_name' ('column_name1' ASC, 'column_name2' ASC, 'column_name3' '[ASC | DESC]')"Depending on your specific application, this can be helpful. Anyway
if_existsoption likeappend_skipdupeswould be much better.Yes, please. ‘append_skipdupes’ should be added and not only for the Primary Key column. If there are duplicates among other Unique columns also it should skip appending those new duplicate rows.
yes, append_skipdupes +1
append_skipdupes +1
Agreed that it would be good to be able to deal with this with options in
df.to_sql().Here’s the workaround I use in sqlite:
Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don’t need to update. It’s just that the form of the data feed is such that I’ll get duplicates that are ignorable.
A problem not solved from 2017 to 2022?
if_existsoperates on table level, an extra keyword arg for ‘skip_duplicates’ is also acceptableAgree ‘append_skipdupes’ should be added.
I have made small script for my use to allow INSERT IGNORE in mysql:
NOTE: This is copy paste from my Database class, please adjust for your use!
https://gist.github.com/tombohub/0c666583c48c1686c736ae2eb76cb2ea
Is this issue resolved yet?
+1 for append_skipdupes
The official discussion in #49246 suggest that this issue seems not to be the current focus point of pandas, suggest closing.
https://github.com/pandas-dev/pandas/issues/15988#issuecomment-823602644 from tombohub seems a workaround for it.
For
upsert, just replaceinsert ignore.Meantime you can use this https://pypi.org/project/pangres/
an other workaround with MariaDb and MySql :
df.to_csv("test.csv")then use :LOAD DATA INFILE 'test.csv' IGNORE INTO TABLE mytableorLOAD DATA INFILE 'test.csv' REPLACE INTO TABLE mytable.LOAD DATA is very faster than INSERT.
complete code:
+1 for append_skipdupes. IMO, an option to update the duplicates would also be nice. Perhaps append_updatedupes.
Is there a postgresql equivalent for this?
+1 for append_skipdupes