pandas: Pandas read_sas error: 'ascii' codec can't decode byte 0xd8 in position 0: ordinal not in range(128)
Hello everybody,
I am using Pandas 0.18 to open a sas7bdat
dataset
I simply use:
df=pd.read_sas('P:/myfile.sas7bdat')
and I get the following error
buf[0:text_block_size].rstrip(b"\x00 ").decode())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 0: ordinal not in range(128)
If I use
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
I get
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 0: invalid continuation byte
Other sas7bdat
files in my folder are handled just fine by Pandas.
When I open the file in SAS I see that the column names are very long and span several lines, but otherwise the files look just fine.
There are not so many possible options in read_sas
… what should I do? Is this a bug in read_sas
?
Many thanks!
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 22 (10 by maintainers)
According to the docs below, depending on the setting of the VALIDVARNAME option, variable names may be either restricted to ASCII, or may be arbitrary bytes to be decoded somehow:
http://support.sas.com/documentation/cdl/en/lrcon/68089/HTML/default/viewer.htm#p18cdcs4v5wd2dn1q0x296d3qek6.htm
I’m not sure if this VALIDVARNAME (which I have never heard of before) is in the file somewhere, or is an option that you specify within the session. In any case, it appears that the column names may need to be decoded.
Also relevant:
http://support.sas.com/documentation/cdl/en/nlsref/61893/HTML/default/viewer.htm#a002601944.htm
On Wed, Apr 6, 2016 at 8:27 AM, Tom Augspurger notifications@github.com wrote:
I believe the
encoding
parameter is just used to decode text data in the actual DataFrame itself, and not the metadata like column headers. Does that sound correct @kshedden ?Ahh, that looks promising though. Does
buf[0:text_block_size].rstrip(b"\x00 ").decode('latin1')
work there?Although, that might not go well with the bit stripping there…
Can you drop into the debugger after it raises the error?
%debug
if your in IPython. Then you can see what’s going on.The docstring says
encoding
is just for decoding string columns (the actual values), so perhaps it isn’t being applied to decoding the column names.Are you able to share that file, or a similar file with non-senstitive data that raises the same error?