pandas: BUG: "with pd.ExcelWriter" produces a corrupt Excel file in case of .xlsm extension

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

excel_path = r"D:\my_test.xlsm"
with pd.ExcelWriter(excel_path, mode='a') as writer:
    # df.to_excel(writer, sheet_name='Sheet1')
    pass

df_excel = pd.read_excel(excel_path, 'Sheet1')

Issue Description

The specified Excel file gets corrupted, despite the append mode. Changing its extension to .xlsx can retrieve the raw data from it, but the macro is forever gone.

In my example I used with and had only a pass statement inside it, yet the file became useless.

Excel_macro_corruption.zip

Q1: Can we know for sure that the default engine is used for .xlsm files? Q2: Should I contact OpenPyxl instead of you with this issue?

Expected Behavior

As I’ve read since my first encounter of this bug, ExcelWriter doesn’t support macros, but a warning of some kind would be nice before wiping out a carefully written VBA code from the Excel file.

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5 python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 1.3.4 numpy : 1.17.3 pytz : 2021.3 dateutil : 2.8.2 pip : 19.3.1 setuptools : 40.8.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 32 (13 by maintainers)

Most upvoted comments

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi?

In order to have a warning, we’d need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

Do not save the writer, close it:

writer.close()

What I do to create a valid .xlsm file directly in python is also with openpyxl I do something like

from openpyxl import Workbook, load_workbook

path = './book.xlsm'
wb = Workbook()
wb.save(path)

wb = load_workbook(path, keep_vba=True)
wb.save(path)
wb.close()

The first part with Workbook() creates the file, but it is corrupted, however, when you open it with load_workbook and the keep_vba=True arg, the file is magically fixed and ready to be manipulated.

The most I could do with this was to modify pandas.io.excel._openpyxl.py.OpenpyxlWriter.__init__.book (line 65) to directly add "keep_vba = True" since this line does not include any motor_kwargs. And from what I tried this doesn’t affect .xlsx files

However to append data to existing sheet, you should do something like:

path = './file.xlsm'
sheet_name = 'sheet_name'
adf = pd.read_excel(path, sheet_name=sheet_name, engine='openpyxl')
wb = load_workbook(path, keep_vba=True)

with pd.ExcelWriter(path, engine = 'openpyxl') as writer:
    writer.book = wb
    writer.sheets = {n: wb[n] for n in wb.sheetnames}
    df.to_excel(writer, sheet_name = sheetname, index=False, header=False, startrow=len(adf)+1)

or modify again _openpyxl.py to add something like an "append" option to def write_cells()

We always write the workbook back, even if no changes are made. That’s how we lose features that are not supported by an engine.

I haven’t been able to discern what this truncate function is doing.

The truncate shouldn’t cause an issue here (it is actually needed). It is there in case the original file is larger than the new file (for example, if unsupported features are dropped by the engine). Otherwise, we would simply overwrite the first n-bits of the file and leave the remaining bits untouched, truncate removes those remaining bits.