pandas: read_csv C-engine CParserError: Error tokenizing data
Hi,
I have encountered a dataset where the C-engine read_csv has problems. I am unsure of the exact issue but I have narrowed it down to a single row which I have pickled and uploaded it to dropbox. If you obtain the pickle try the following:
df = pd.read_pickle('faulty_row.pkl')
df.to_csv('faulty_row.csv', encoding='utf8', index=False)
df.read_csv('faulty_row.csv', encoding='utf8')
I get the following exception:
CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
If you try and read the CSV using the python engine then no exception is thrown:
df.read_csv('faulty_row.csv', encoding='utf8', engine='python')
Suggesting that the issue is with read_csv and not to_csv. The versions I using are:
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-28-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
IPython: 3.2.1
patsy: 0.3.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Reactions: 22
- Comments: 19 (3 by maintainers)
I missed @alfonsomhc answer because it just looked like a comment.
You need
Your second-to-last line includes an
'\r'
break. I think it’s a bug, but one workaround is to open in universal-new-line mode.I have also found this issue when reading a large csv file with the default egine. If I use engine=‘python’ then it works fine.
+1
had the same issue trying to read a folder not a csv file
@joshlk Did you use the
names=range(2504)
option as described in the comment above?I tried this approach and was able to upload large data files. But when I checked the dimension of the dataframe I saw that the number of rows have increased. What can be the logical regions for that?