pandas: BUG: ExcelWriter with mode='a' corrupts file

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame()

with pd.ExcelWriter('test.xlsx', mode='w', engine='openpyxl') as writer:
    df.to_excel(writer)

with pd.ExcelWriter('test.xlsx', mode='a', engine='openpyxl') as writer:
    df.to_excel(writer)

Problem description

Appending to excel file with pandas>=1.2 produces file which Excel reports as corrupted and opens only after repair, forcing you to first open and repair file before passing it to someone else. Looking at file with 7-Zip File Manager shows that after writing file with mode=‘a’, some .xml files (app.xml, core.xml, theme1.xml, sheet1.xml) inside archive are duplicated (somehow). After Excel rapars file duplicates are gone, everything else looks the same.

With pandas 1.1.5 and same versions of other packages everything works as expected.

Expected Output

File which could be opened by Excel without any warnings.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 9d598a5e1eee26df95b3910e3f2934890d062caa python : 3.8.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Russian_Russia.1251 pandas : 1.2.1 numpy : 1.20.0 pytz : 2021.1 dateutil : 2.8.1 pip : 21.0.1 setuptools : 49.6.0.post20210108 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.2 html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : 3.0.6 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (15 by maintainers)

Most upvoted comments

@twoertwein thanks for checking, should have looked at the code. Your propsed example works for me.

And I agree that the previous example is not expected to work.

Yep, works now as it should.

The root cause is slightly simpler then I first thought: 1.2 opens the file for reading and writing. But I forgot to call seek(0) after reading, which made openpyxl append a second zip file to the first zip file instead of replacing the content. A simple seek(0) fixes the issue 😃 I also added truncate() in case the combined workbook happens to be smaller than the existing workbook (not sure whether that is possible).

I’m also able to reproduce it on linux (debian) and unzip doesn’t show any duplicates, but p7zip package does.

File created with pandas 1.1:

$ 7z l test-pandas1.1.xlsx
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02

Scanning the drive for archives:
1 file, 5179 bytes (6 KiB)

Listing archive: test-pandas1.1.xlsx

--
Path = test-pandas1.1.xlsx
Type = zip
Physical Size = 5179

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:01:10 .....          177          129  docProps/app.xml
2021-02-04 12:01:10 .....          459          233  docProps/core.xml
2021-02-04 12:01:10 .....        10140         1552  xl/theme/theme1.xml
2021-02-04 12:01:10 .....          456          273  xl/worksheets/sheet1.xml
2021-02-04 12:01:10 .....          456          273  xl/worksheets/sheet2.xml
2021-02-04 12:01:12 .....         2631          594  xl/styles.xml
2021-02-04 12:01:12 .....          534          192  _rels/.rels
2021-02-04 12:01:12 .....          617          316  xl/workbook.xml
2021-02-04 12:01:12 .....          653          180  xl/_rels/workbook.xml.rels
2021-02-04 12:01:12 .....         1120          287  [Content_Types].xml
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:01:12              17243         4029  10 files

File created with pandas 1.2:

$ 7z l test-pandas1.2.xlsx
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02

Scanning the drive for archives:
1 file, 7562 bytes (8 KiB)

Listing archive: test-pandas1.2.xlsx

--
Path = test-pandas1.2.xlsx
Type = zip
Physical Size = 7562
Embedded Stub Size = 2383

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:02:24 .....          177          129  docProps/app.xml
2021-02-04 12:02:24 .....          459          233  docProps/core.xml
2021-02-04 12:02:24 .....        10140         1552  xl/theme/theme1.xml
2021-02-04 12:02:24 .....          456          273  xl/worksheets/sheet1.xml
2021-02-04 12:02:24 .....          177          129  docProps/app.xml
2021-02-04 12:02:24 .....          459          233  docProps/core.xml
2021-02-04 12:02:24 .....        10140         1552  xl/theme/theme1.xml
2021-02-04 12:02:24 .....          456          273  xl/worksheets/sheet1.xml
2021-02-04 12:02:24 .....          456          273  xl/worksheets/sheet2.xml
2021-02-04 12:02:24 .....         2631          594  xl/styles.xml
2021-02-04 12:02:24 .....          534          192  _rels/.rels
2021-02-04 12:02:24 .....          617          316  xl/workbook.xml
2021-02-04 12:02:24 .....          653          180  xl/_rels/workbook.xml.rels
2021-02-04 12:02:24 .....         1120          287  [Content_Types].xml
------------------- ----- ------------ ------------  ------------------------
2021-02-04 12:02:24              28475         6216  14 files

File produced with pandas 1.2 is sugnificantly larger (Physical Size = 7562 vs Physical Size = 5179) and includes Embedded Stub. And it looks like file from pandas 1.2 larger by exactly the size of Embedded Stub 7562-2383=5179.

Looks like this explains why unzip shows no duplicates.