pandas: Pandas' read_excel, ExcelFile, failing to open some .xls files.

I am trying to read in .xls files from http://www.eia.gov/coal/data.cfm#production – specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls file that’s freely available via the dropdown. Pandas cannot read it.

In contrast, the file for the most recent year available, 2013, coalpublic2013.xls file, works without a problem:

import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")

but the next decade of .xls files (2004-2012) do not load. I have looked at these files with Excel, and they open, and are not corrupted.

The error that I get from pandas is:

XLRDError                                 Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
    161 
    162     if not isinstance(io, ExcelFile):
--> 163         io = ExcelFile(io, engine=engine)
    164 
    165     return io._parse_excel(

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
    204                 self.book = xlrd.open_workbook(file_contents=data)
    205             else:
--> 206                 self.book = xlrd.open_workbook(io)
    207         elif engine == 'xlrd' and isinstance(io, xlrd.Book):
    208             self.book = io

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    433         formatting_info=formatting_info,
    434         on_demand=on_demand,
--> 435         ragged_rows=ragged_rows,
    436         )
    437     return bk

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
     89         t1 = time.clock()
     90         bk.load_time_stage_1 = t1 - t0
---> 91         biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
     92         if not biff_version:
     93             raise XLRDError("Can't determine file's BIFF version")

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
   1228             bof_error('Expected BOF record; met end of file')
   1229         if opcode not in bofcodes:
-> 1230             bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
   1231         length = self.get2bytes()
   1232         if length == MY_EOF:

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
   1222         if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
   1223         def bof_error(msg):
-> 1224             raise XLRDError('Unsupported format, or corrupt file: ' + msg)
   1225         savpos = self._position
   1226         opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'```

And I have tried various other things: 

```df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")

to no avail. My pandas version: 0.17.0

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Comments: 28 (9 by maintainers)

Most upvoted comments

Hi, I had faced the same xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; error and solved it by writing an XML to XLSX converter.


import pandas as pd
from bs4 import BeautifulSoup

def convert_to_xlsx():
    with open('sample.xls') as xml_file:
        soup = BeautifulSoup(xml_file.read(), 'xml')
        writer = pd.ExcelWriter('sample.xlsx')
        for sheet in soup.findAll('Worksheet'):
            sheet_as_list = []
            for row in sheet.findAll('Row'):
                sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
            pd.DataFrame(sheet_as_list).to_excel(writer, sheet_name=sheet.attrs['ss:Name'], index=False, header=False)

        writer.save()

@jbwhit I have run the following code:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.

I was receiving “XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b’<head><M’”

The function pandas.read_html function worked for me, it returns a list of dataframes:

(df, ) = pd.read_html('file.xls')

@darshanlol If you follow the various threads, you’ll find that there are valid Excel files that cannot be read by Pandas, and that no one thinks this is a bug.

  • Pandas support will say that it’s an xlrd problem, not a pandas problem, and will close (this) thread
  • xlrd here will say, “the file has been saved as “XML Spreadsheet (*.xml)” i.e. NOT in XLS or XLSX format, not supported by xlrd”, and will close the thread
  • And if you happen to email eia.gov and ask them to provide their data in a different Excel format (or csv), they will simply ignore your request

As far as I can tell it’s a “won’t fix” situation by everyone involved.

Although a non-Python solution is not very satisfying, this is an issue your students will face “in the wild”. If one has access to a Windows 10 OS, PowerShell can be used to iterate over each of the 1997-2003 excel XML formatted files and convert them to the current .xlsx format. This is a quick and dirty solution I resort to in my day-to-day work. An example is attached convertXls2Xlsx.txt

So far the “fix” is to open in Excel/Windows, re-save it, then open with pandas/xlrd. Not sure of other options at this point.

@jbwhit this is c). we use the xlrd library to read these files. So this is an upstream to pandas issue. You can file a report there.

Pandas support will say that it’s an xlrd problem, not a pandas problem, and will close (this) thread

To clarify, there is no code in pandas itself for actually reading the excel file, we only rely on third-party libraries, and I don’t think we are going the change that. But it’s not that we don’t want to be able to read such format. If somebody writes a library (with similar api as xlrd) that can read it, we might consider relying upon it.

@vermastuti you seem to misunderstand my report. The key part is starting with my example Excel spreadsheet, namely, go to this link: http://www.eia.gov/coal/data.cfm#production then click the year 2012, and download the file named coalpublic2012.xls.

Pandas cannot read it.

Your code does not read that file. Here’s the relevant error when using your code (today):

import pandas as pd
dataFileUrl = "coalpublic2012.xls"
data = pd.read_excel(dataFileUrl)

---------------------------------------------------------------------------
XLRDError                                 Traceback (most recent call last)
<ipython-input-6-e8da944daef2> in <module>()
      1 import pandas as pd
      2 dataFileUrl = "/Users/jonathan/junk/coalpublic2012.xls"
----> 3 data = pd.read_excel(dataFileUrl)

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/io/excel.py in read_excel(io, sheet_name, header, skiprows, skip_footer, index_col, names, usecols, parse_dates, date_parser, na_values, thousands, convert_float, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    228 
    229     if not isinstance(io, ExcelFile):
--> 230         io = ExcelFile(io, engine=engine)
    231 
    232     return io._parse_excel(

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    292             self.book = xlrd.open_workbook(file_contents=data)
    293         elif isinstance(self._io, compat.string_types):
--> 294             self.book = xlrd.open_workbook(self._io)
    295         else:
    296             raise ValueError('Must explicitly set engine if not passing in'

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    160         formatting_info=formatting_info,
    161         on_demand=on_demand,
--> 162         ragged_rows=ragged_rows,
    163         )
    164     return bk

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
     89         t1 = time.clock()
     90         bk.load_time_stage_1 = t1 - t0
---> 91         biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
     92         if not biff_version:
     93             raise XLRDError("Can't determine file's BIFF version")

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in getbof(self, rqd_stream)
   1269             bof_error('Expected BOF record; met end of file')
   1270         if opcode not in bofcodes:
-> 1271             bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
   1272         length = self.get2bytes()
   1273         if length == MY_EOF:

~/miniconda3/envs/dspy3/lib/python3.5/site-packages/xlrd/book.py in bof_error(msg)
   1263         if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
   1264         def bof_error(msg):
-> 1265             raise XLRDError('Unsupported format, or corrupt file: ' + msg)
   1266         savpos = self._position
   1267         opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<?xml ve'

@jbwhit I have run the following code:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.

Thank you man. You are my hero.