pandas: BUG: `read_stata` always uses 'utf8'
Code Sample, a copy-pastable example if possible
import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576)
for chunk in data:
pass # do something with chunk (never reached)
This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte
.
OK. So the file isn’t a utf8 one. Even though the StataReader doesn’t specify any Unicode support; I then try and open it with a latin-1 encoding:
import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1')
for chunk in data:
pass # do something with chunk (never reached)
This raises the same exception at exactly the same place (still utf-8).
Problem description
This is a problem because it appears that read_stata
doesn’t honour the encoding
argument.
I think this line introduced a bug. The StataReader
doesn’t manage any other type of data than ascii or latin-1.
Changing the line 1338 of the pandas.io.stata
module:
return s.decode('utf-8')
to:
return s.decode('latin-1')
Seemed to make everything work and I could read the data from the given file. Even better, changing it to the following:
return s.decode(self._encoding or self._default_encoding)
also seems to have made it work.
I believe though, that if you want to make this work with Unicode too you’d have to add the following encodings to VALID_ENCODINGS
: utf-8, utf8, iso10646.
Expected Output
The file should be correctly read and parsed
Output of pd.show_versions()
pandas: 0.24.0.dev0+41.gb2eec25 pytest: 3.2.3 pip: 9.0.3 setuptools: 36.6.0 Cython: 0.28.2 numpy: 1.13.3 scipy: 1.0.0 pyarrow: None xarray: None IPython: 5.1.0 sphinx: 1.6.3 patsy: None dateutil: 2.7.3 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: 2.4.9 xlrd: 1.0.0 xlwt: 1.3.0 xlsxwriter: None lxml: 3.8.0 bs4: None html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 25 (12 by maintainers)
Commits related to this issue
- BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
- BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
- BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
- BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
- BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
- BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
- BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
- BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
- BUG: Fix handling of encoding for the StataReader #21244 — committed to adrian-castravete/pandas by deleted user 6 years ago
I am still having issues with this. I’m using a 118 Stata file, and I’m getting the same
UnicodeDecodeError
. When I edit thestata.py
file to uselatin-1
as per @adrian-castravete, everything works.