pandas: BUG: read_csv, sep=None, wrong separator guessing in case of one column csv
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import csv
import pandas as pd
one_col_csv = """HEADER
hzhrzhjtrj
rzthyy
azr231
zrhtzrh
zrhtzrh"""
with open("dummy.csv", "w") as file:
file.write(one_col_csv)
print(pd.read_csv("dummy.csv", sep=None))
#Bad detection of the separator, but no error
# According to the doc, csv sniffer is used in backend
# So I also tested it:
with open("dummy.csv") as fp:
print(csv.Sniffer().sniff(fp.read(5000)).delimiter)
# Throw an error "Could not determine delimiter" (which is better)
Issue Description
When pd.reader is used with the parameter sep=None, it failed to automaticaly detect signle column csv.
I have already read this conversation (2016): https://github.com/pandas-dev/pandas/issues/13839 They had the same problem but come to the conclusion that the error comes csv sniffer python library.
But I think this is no longer relevant as csv.sniffer now raise an error when pandas still insist on parsing it wrong.
Expected Behavior
It should automatically detect it is a single column file, or at least raise an error.
Installed Versions
INSTALLED VERSIONS
commit : 37ea63d540fd27274cad6585082c91b1283f963d python : 3.8.16.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-41-generic Version : #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 2.0.1 numpy : 1.23.0 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.6.1 pip : 23.1 Cython : 0.29.34 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.12.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : 0.56.4 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.0 snappy : None sqlalchemy : 2.0.9 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 22 (14 by maintainers)
Commits related to this issue
- docs: #53035 clarify the behavior of sep=None. — committed to Toroi0610/pandas by Toroi0610 a year ago
- DOC: #53035 clarify the behavior of sep=None. (#53153) docs: #53035 clarify the behavior of sep=None. — committed to pandas-dev/pandas by Toroi0610 a year ago
- DOC: #53035 clarify the behavior of sep=None. (#53153) docs: #53035 clarify the behavior of sep=None. — committed to Rylie-W/pandas by Toroi0610 a year ago
- DOC: #53035 clarify the behavior of sep=None. (#53153) docs: #53035 clarify the behavior of sep=None. — committed to Daquisu/pandas by Toroi0610 a year ago
I discovered this beautiful library “CleverCSV” that may solve our(my) problem: https://github.com/alan-turing-institute/CleverCSV
It may be interesting to integrate it in pandas directly or using it as backend to read csv instead of the build-in csv library…
I think the best compromise may be something like passing the 30 first line of the csv file to the sniffer. Something like this:
It will hugely increase the reliability of the sniffer, while still being fast as we didn’t read all the file… What do you think about this alternative?
(I really think that sep=None is a very convenient feature that should be improve instead of deleted…)
@DeaMariaLeon yes, I do.
Not a fan of the sniffing,
that said exploring passing the first header + 1 lines might be interesting. Should definitely clarify the docs
Passing in the entire csv is also not performant. Also it’s very easy to come up with a csv that would still cause the sniffer to give you the current results.
For example, for the below contrived example - it still outputs
'E'
as the delimiter:Hence my suggestion to either
csv.Sniffer
to determine the delimiter, orsep=None
as an option - but I don’t know if pandas uses this internally anywhere else.The docs for
read_csv
does mention that it uses the builtincsv.Sniffer
I would say this behaves as expected. A line stating that only the first valid line is passed on to
csv.Sniffer
could probably be added to the docs.