pandas: BUG: read_excel with openpyxl results in empty data frame

  • [ X ] I have checked that this issue has not already been reported.

  • [ X ] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Old behavior with xlrd 1.2.0 (last version with XLSX support)

>>> import pandas as pd
>>> df=pd.read_excel('infile.xlsx', engine='xlrd')
<stdin>:1: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
>>> df.shape
(843, 11)

File is read perfectly and dataframe is ok, with all 11 columns.

Problem description

When using openpyxl as engine for read_excel:

>>> import pandas as pd
>>> df=pd.read_excel('infile.xlsx', engine='openpyxl')
>>> print(df)
Empty DataFrame
Columns: [Column1, Column2, Column3]
Index: []
>>> df.shape
(0, 3)

Only the 3 first headers are read, and it stops.

However, if the file is opened directly with openpyxl, it works fine, I’m using it to open and save the file with a temporary name to make openpyxl work as engine on Pandas:

>>> wb = load_workbook('infile.xlsx')
>>> wb.save(filename = 'temp.xlsx')
>>> pd.read_excel('temp.xlsx', engine='openpyxl')
>>> df.shape
(843, 11)

INSTALLED VERSIONS

commit : 3e89b4c4b1580aa890023fc550774e63d499da25 python : 3.8.6.final.0 python-bits : 64 OS : Linux OS-release : 4.19.128-microsoft-standard Version : #1 SMP Tue Jun 23 12:58:10 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.2.0 numpy : 1.19.5 pytz : 2020.5 dateutil : 2.8.1 pip : 20.1.1 setuptools : 49.3.1 Cython : 0.29.21 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.3.7 lxml.etree : 4.6.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.19.0 pandas_datareader: None bs4 : None bottleneck : None fsspec : 0.8.5 fastparquet : None gcsfs : None matplotlib : 3.3.3 numexpr : None odfpy : None openpyxl : 3.0.5 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.0 sqlalchemy : None tables : None tabulate : 0.8.7 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 numba : None

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 10
  • Comments: 25 (12 by maintainers)

Most upvoted comments

Turning read-only mode off has other implications though, in particular higher memory consumption.

According to openpyxl’s docs, the recommended solution is to call sheet.reset_dimensions(). I. e. at the beginning of OpenpyxlReader.get_sheet_data():

    if sheet.calculate_dimension() == "A1:A1":
            sheet.reset_dimensions()

I monkeypatched this for our application and it seems to work fine.

All benchmarks below are on an excel file created via pd.DataFrame(np.random.rand(10000, 10)).to_excel("test.xlsx"), looped for 20 iterations. The code for each iteration is pd.read_excel('test.xlsx', engine='openpyxl'), where I am modifying the pandas internals to make sure there aren’t any gotchas.

Using read_only=False takes 24.532 seconds as opposed to 19.897 with read_only=True. Calling reset_dimension is just the line:

self._max_row = self._max_column = None

However, by resetting the dimension alone openpyxl will no longer pad the rows when reading, and this results in the behavior that is the cause of #38956. To get correct results, one must also call calculate_dimension which unfortunately iterates in python space. It then seems to me that the approach taken in #39486, along with calling reset_dimension is the best one. This takes 20.183 seconds.

Hmm sorry, hadn’t realized yours was actually set to A1:A3. That does indeed complicate matters 😕 (also you really need to post that now 😉

With both @xmatthias sample and ours (the files we received were produced by Apache POI by the way), they had the dimension element set to plain <dimension ref="A1"/>, i.e. they did just not bother to compute it. I’m attaching a (redacted) example.

The only ways I can think of then are either to always call reset_dimensions() (I wager this is still faster than switching read-only-mode off), or introduce a switch to force the call.

apache-poi-sample.xlsx

Per https://github.com/pandas-dev/pandas/issues/39001#issuecomment-759862350

  if sheet.calculate_dimension() == "A1:A1":
            sheet.reset_dimensions()

this check would fail for @luciodaou 's file (they get a dimension of A1:C1). Say even if the check gets changed to both the numbers being 1

There may then be cases where sheet.calculate_dimension() returns A1:B2 which could be “visually inaccurate”

Not sure if there’s a good way around this.

The maintainer off xlrd explicitly talks about security vulnerabilities in every version of xlrd, you can always use engine="xlrd"

Well it’s always possible to version-pin xlrd to the latest working version until the regression in pandas is fixed / a solution to this is found, so i don’t really see that as blocking argument.

I also downgraded pandas (and xlrd) again after running into this issue.

Honestly, i’d also rather like to see a proper fix, as otherwise this will keep comming up again and again - but i don’t think a fix that moves the responsibility to figure out the correct parameters to the user is going to get much love from the community.

@luciodaou could you try the snippet I posted above and see if you see a difference for your file?

To clarify - I don’t necessarily think what openpyxl is doing is wrong. Openpyxl’s read_only mode relies on whatever is generating the files to provide correct sheet info. The dimension being returned for the file that @xmatthias shared is A1 (if you unzip the excel file and parse the XML for that worksheet you get - <dimension ref="A1"/>). I created a random file via Excel and I can see it saves the dimension as A1:B10. The reason re-saving using openpyxl works is probably because they correctly re-compute the dimensions of the sheet and save it back.

Currently pandas uses read_only=True as a default and doesn’t provide an interface for the user to change the parameters being used. Maybe a good option here would be to add in a parameter to the read_excel API to pandas called engine_params that allows the user to be able to override the defaults used by pandas.

I do have the same exact thing with one file, while another file works just fine.

Unfortunately, in my case, i’m unable to share the broken file as it’s a file containing proprietary information.

The file was generated by an automated export system - Excel (and pandas with xlrd==1.2.0) open the file without error - however resaving the file via excel slightly reduces the filesize - and fixes the problem with openpyxl.

However, opening and resaving the file cannot be the correct solution, as it’s supposed to be an automated system without me (or excel) in the loop.

EDIT i’ve been able to find another file i am able to share where the same thing happens. xxx_test.xlsx

Hi, thanks for your report. Could you provide your file?