pandas: BUG: `read_stata` always uses 'utf8'

Code Sample, a copy-pastable example if possible

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576)
for chunk in data:
    pass # do something with chunk (never reached)

This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte. OK. So the file isn’t a utf8 one. Even though the StataReader doesn’t specify any Unicode support; I then try and open it with a latin-1 encoding:

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1')
for chunk in data:
    pass # do something with chunk (never reached)

This raises the same exception at exactly the same place (still utf-8).

Problem description

This is a problem because it appears that read_stata doesn’t honour the encoding argument. I think this line introduced a bug. The StataReader doesn’t manage any other type of data than ascii or latin-1.

Changing the line 1338 of the pandas.io.stata module:

        return s.decode('utf-8')

to:

        return s.decode('latin-1')

Seemed to make everything work and I could read the data from the given file. Even better, changing it to the following:

        return s.decode(self._encoding or self._default_encoding)

also seems to have made it work.

I believe though, that if you want to make this work with Unicode too you’d have to add the following encodings to VALID_ENCODINGS: utf-8, utf8, iso10646.

Expected Output

The file should be correctly read and parsed

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: ro_RO.UTF-8 LANG: ro_RO.UTF-8 LOCALE: None.None

pandas: 0.24.0.dev0+41.gb2eec25 pytest: 3.2.3 pip: 9.0.3 setuptools: 36.6.0 Cython: 0.28.2 numpy: 1.13.3 scipy: 1.0.0 pyarrow: None xarray: None IPython: 5.1.0 sphinx: 1.6.3 patsy: None dateutil: 2.7.3 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: 2.4.9 xlrd: 1.0.0 xlwt: 1.3.0 xlsxwriter: None lxml: 3.8.0 bs4: None html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 2
Comments: 25 (12 by maintainers)

Commits related to this issue

BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago

Most upvoted comments

I am still having issues with this. I’m using a 118 Stata file, and I’m getting the same UnicodeDecodeError. When I edit the stata.py file to use latin-1 as per @adrian-castravete, everything works.

hudcap on Feb 24, 2019