pandas: BUG: read_excel with openpyxl results in empty data frame
-
[ X ] I have checked that this issue has not already been reported.
-
[ X ] I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Old behavior with xlrd 1.2.0 (last version with XLSX support)
>>> import pandas as pd
>>> df=pd.read_excel('infile.xlsx', engine='xlrd')
<stdin>:1: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
>>> df.shape
(843, 11)
File is read perfectly and dataframe is ok, with all 11 columns.
Problem description
When using openpyxl as engine for read_excel:
>>> import pandas as pd
>>> df=pd.read_excel('infile.xlsx', engine='openpyxl')
>>> print(df)
Empty DataFrame
Columns: [Column1, Column2, Column3]
Index: []
>>> df.shape
(0, 3)
Only the 3 first headers are read, and it stops.
However, if the file is opened directly with openpyxl, it works fine, I’m using it to open and save the file with a temporary name to make openpyxl work as engine on Pandas:
>>> wb = load_workbook('infile.xlsx')
>>> wb.save(filename = 'temp.xlsx')
>>> pd.read_excel('temp.xlsx', engine='openpyxl')
>>> df.shape
(843, 11)
INSTALLED VERSIONS
commit : 3e89b4c4b1580aa890023fc550774e63d499da25 python : 3.8.6.final.0 python-bits : 64 OS : Linux OS-release : 4.19.128-microsoft-standard Version : #1 SMP Tue Jun 23 12:58:10 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.2.0 numpy : 1.19.5 pytz : 2020.5 dateutil : 2.8.1 pip : 20.1.1 setuptools : 49.3.1 Cython : 0.29.21 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.3.7 lxml.etree : 4.6.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.19.0 pandas_datareader: None bs4 : None bottleneck : None fsspec : 0.8.5 fastparquet : None gcsfs : None matplotlib : 3.3.3 numexpr : None odfpy : None openpyxl : 3.0.5 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.0 sqlalchemy : None tables : None tabulate : 0.8.7 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 numba : None
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 10
- Comments: 25 (12 by maintainers)
Turning read-only mode off has other implications though, in particular higher memory consumption.
According to openpyxl’s docs, the recommended solution is to call sheet.reset_dimensions(). I. e. at the beginning of
OpenpyxlReader.get_sheet_data()
:I monkeypatched this for our application and it seems to work fine.
All benchmarks below are on an excel file created via
pd.DataFrame(np.random.rand(10000, 10)).to_excel("test.xlsx")
, looped for 20 iterations. The code for each iteration ispd.read_excel('test.xlsx', engine='openpyxl')
, where I am modifying the pandas internals to make sure there aren’t any gotchas.Using
read_only=False
takes 24.532 seconds as opposed to 19.897 withread_only=True
. Callingreset_dimension
is just the line:However, by resetting the dimension alone openpyxl will no longer pad the rows when reading, and this results in the behavior that is the cause of #38956. To get correct results, one must also call
calculate_dimension
which unfortunately iterates in python space. It then seems to me that the approach taken in #39486, along with callingreset_dimension
is the best one. This takes 20.183 seconds.Hmm sorry, hadn’t realized yours was actually set to A1:A3. That does indeed complicate matters 😕 (also you really need to post that now 😉
With both @xmatthias sample and ours (the files we received were produced by Apache POI by the way), they had the dimension element set to plain
<dimension ref="A1"/>
, i.e. they did just not bother to compute it. I’m attaching a (redacted) example.The only ways I can think of then are either to always call
reset_dimensions()
(I wager this is still faster than switching read-only-mode off), or introduce a switch to force the call.apache-poi-sample.xlsx
Per https://github.com/pandas-dev/pandas/issues/39001#issuecomment-759862350
this check would fail for @luciodaou 's file (they get a dimension of A1:C1). Say even if the check gets changed to both the numbers being 1
There may then be cases where
sheet.calculate_dimension()
returnsA1:B2
which could be “visually inaccurate”Not sure if there’s a good way around this.
The maintainer off xlrd explicitly talks about security vulnerabilities in every version of xlrd, you can always use
engine="xlrd"
Well it’s always possible to version-pin xlrd to the latest working version until the regression in pandas is fixed / a solution to this is found, so i don’t really see that as blocking argument.
I also downgraded pandas (and xlrd) again after running into this issue.
Honestly, i’d also rather like to see a proper fix, as otherwise this will keep comming up again and again - but i don’t think a fix that moves the responsibility to figure out the correct parameters to the user is going to get much love from the community.
@luciodaou could you try the snippet I posted above and see if you see a difference for your file?
To clarify - I don’t necessarily think what
openpyxl
is doing is wrong. Openpyxl’s read_only mode relies on whatever is generating the files to provide correct sheet info. The dimension being returned for the file that @xmatthias shared isA1
(if you unzip the excel file and parse the XML for that worksheet you get -<dimension ref="A1"/>
). I created a random file via Excel and I can see it saves the dimension asA1:B10
. The reason re-saving using openpyxl works is probably because they correctly re-compute the dimensions of the sheet and save it back.Currently pandas uses
read_only=True
as a default and doesn’t provide an interface for the user to change the parameters being used. Maybe a good option here would be to add in a parameter to theread_excel
API to pandas calledengine_params
that allows the user to be able to override the defaults used by pandas.I do have the same exact thing with one file, while another file works just fine.
Unfortunately, in my case, i’m unable to share the broken file as it’s a file containing proprietary information.
The file was generated by an automated export system - Excel (and pandas with xlrd==1.2.0) open the file without error - however resaving the file via excel slightly reduces the filesize - and fixes the problem with openpyxl.
However, opening and resaving the file cannot be the correct solution, as it’s supposed to be an automated system without me (or excel) in the loop.
EDIT i’ve been able to find another file i am able to share where the same thing happens. xxx_test.xlsx
Hi, thanks for your report. Could you provide your file?