pandas: Pandas read_sas error: 'ascii' codec can't decode byte 0xd8 in position 0: ordinal not in range(128)

Hello everybody,

I am using Pandas 0.18 to open a sas7bdat dataset

I simply use:

df=pd.read_sas('P:/myfile.sas7bdat')

and I get the following error

    buf[0:text_block_size].rstrip(b"\x00 ").decode())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 0: ordinal not in range(128)

If I use

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

I get

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 0: invalid continuation byte

Other sas7bdat files in my folder are handled just fine by Pandas.

When I open the file in SAS I see that the column names are very long and span several lines, but otherwise the files look just fine.

There are not so many possible options in read_sas… what should I do? Is this a bug in read_sas?

Many thanks!

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 22 (10 by maintainers)

Most upvoted comments

According to the docs below, depending on the setting of the VALIDVARNAME option, variable names may be either restricted to ASCII, or may be arbitrary bytes to be decoded somehow:

http://support.sas.com/documentation/cdl/en/lrcon/68089/HTML/default/viewer.htm#p18cdcs4v5wd2dn1q0x296d3qek6.htm

I’m not sure if this VALIDVARNAME (which I have never heard of before) is in the file somewhere, or is an option that you specify within the session. In any case, it appears that the column names may need to be decoded.

Also relevant:

http://support.sas.com/documentation/cdl/en/nlsref/61893/HTML/default/viewer.htm#a002601944.htm

On Wed, Apr 6, 2016 at 8:27 AM, Tom Augspurger notifications@github.com wrote:

I believe the encoding parameter is just used to decode text data in the actual DataFrame itself, and not the metadata like column headers. Does that sound correct @kshedden https://github.com/kshedden ?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/12809#issuecomment-206347584

kshedden on Apr 6, 2016

I believe the encoding parameter is just used to decode text data in the actual DataFrame itself, and not the metadata like column headers. Does that sound correct @kshedden ?

TomAugspurger on Apr 6, 2016

Ahh, that looks promising though. Does buf[0:text_block_size].rstrip(b"\x00 ").decode('latin1') work there?

Although, that might not go well with the bit stripping there…

TomAugspurger on Apr 6, 2016

Can you drop into the debugger after it raises the error? %debug if your in IPython. Then you can see what’s going on.

The docstring says encoding is just for decoding string columns (the actual values), so perhaps it isn’t being applied to decoding the column names.

TomAugspurger on Apr 6, 2016

Are you able to share that file, or a similar file with non-senstitive data that raises the same error?

TomAugspurger on Apr 6, 2016