pandas: Pandas' read_excel, ExcelFile, failing to open some .xls files.
I am trying to read in .xls
files from http://www.eia.gov/coal/data.cfm#production – specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls
file that’s freely available via the dropdown. Pandas cannot read it.
In contrast, the file for the most recent year available, 2013, coalpublic2013.xls
file, works without a problem:
import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")
but the next decade of .xls
files (2004-2012) do not load. I have looked at these files with Excel, and they open, and are not corrupted.
The error that I get from pandas is:
XLRDError Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")
/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
161
162 if not isinstance(io, ExcelFile):
--> 163 io = ExcelFile(io, engine=engine)
164
165 return io._parse_excel(
/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
204 self.book = xlrd.open_workbook(file_contents=data)
205 else:
--> 206 self.book = xlrd.open_workbook(io)
207 elif engine == 'xlrd' and isinstance(io, xlrd.Book):
208 self.book = io
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
433 formatting_info=formatting_info,
434 on_demand=on_demand,
--> 435 ragged_rows=ragged_rows,
436 )
437 return bk
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
89 t1 = time.clock()
90 bk.load_time_stage_1 = t1 - t0
---> 91 biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
92 if not biff_version:
93 raise XLRDError("Can't determine file's BIFF version")
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
1228 bof_error('Expected BOF record; met end of file')
1229 if opcode not in bofcodes:
-> 1230 bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
1231 length = self.get2bytes()
1232 if length == MY_EOF:
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
1222 if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
1223 def bof_error(msg):
-> 1224 raise XLRDError('Unsupported format, or corrupt file: ' + msg)
1225 savpos = self._position
1226 opcode = self.get2bytes()
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'```
And I have tried various other things:
```df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")
to no avail. My pandas version: 0.17.0
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Comments: 28 (9 by maintainers)
Hi, I had faced the same
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record;
error and solved it by writing an XML to XLSX converter.@jbwhit I have run the following code:
This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.
I was receiving “XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b’<head><M’”
The function pandas.read_html function worked for me, it returns a list of dataframes:
@darshanlol If you follow the various threads, you’ll find that there are valid Excel files that cannot be read by Pandas, and that no one thinks this is a bug.
xlrd
problem, not a pandas problem, and will close (this) threadxlrd
here will say, “the file has been saved as “XML Spreadsheet (*.xml)” i.e. NOT in XLS or XLSX format, not supported by xlrd”, and will close the threadAs far as I can tell it’s a “won’t fix” situation by everyone involved.
Although a non-Python solution is not very satisfying, this is an issue your students will face “in the wild”. If one has access to a Windows 10 OS, PowerShell can be used to iterate over each of the 1997-2003 excel XML formatted files and convert them to the current .xlsx format. This is a quick and dirty solution I resort to in my day-to-day work. An example is attached convertXls2Xlsx.txt
So far the “fix” is to open in Excel/Windows, re-save it, then open with pandas/xlrd. Not sure of other options at this point.
@jbwhit this is c). we use the
xlrd
library to read these files. So this is an upstream to pandas issue. You can file a report there.I found a solution here https://stackoverflow.com/a/24476655/12692651
To clarify, there is no code in pandas itself for actually reading the excel file, we only rely on third-party libraries, and I don’t think we are going the change that. But it’s not that we don’t want to be able to read such format. If somebody writes a library (with similar api as xlrd) that can read it, we might consider relying upon it.
@vermastuti you seem to misunderstand my report. The key part is starting with my example Excel spreadsheet, namely, go to this link: http://www.eia.gov/coal/data.cfm#production then click the year 2012, and download the file named
coalpublic2012.xls
.Pandas cannot read it.
Your code does not read that file. Here’s the relevant error when using your code (today):
Thank you man. You are my hero.