pandas: BUG: read_csv, sep=None, wrong separator guessing in case of one column csv

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import csv
import pandas as pd

one_col_csv = """HEADER
hzhrzhjtrj
rzthyy
azr231
zrhtzrh
zrhtzrh"""

with open("dummy.csv", "w") as file:
    file.write(one_col_csv)

print(pd.read_csv("dummy.csv", sep=None))
#Bad detection of the separator, but no error

# According to the doc, csv sniffer is used in backend
# So I also tested it:
with open("dummy.csv") as fp:
    print(csv.Sniffer().sniff(fp.read(5000)).delimiter)
    # Throw an error "Could not determine delimiter" (which is better)

Issue Description

When pd.reader is used with the parameter sep=None, it failed to automaticaly detect signle column csv.

I have already read this conversation (2016): https://github.com/pandas-dev/pandas/issues/13839 They had the same problem but come to the conclusion that the error comes csv sniffer python library.

But I think this is no longer relevant as csv.sniffer now raise an error when pandas still insist on parsing it wrong.

Expected Behavior

It should automatically detect it is a single column file, or at least raise an error.

Installed Versions

INSTALLED VERSIONS

commit : 37ea63d540fd27274cad6585082c91b1283f963d python : 3.8.16.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-41-generic Version : #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.0.1 numpy : 1.23.0 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.6.1 pip : 23.1 Cython : 0.29.34 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.12.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : 0.56.4 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.0 snappy : None sqlalchemy : 2.0.9 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 22 (14 by maintainers)

Commits related to this issue

Most upvoted comments

I discovered this beautiful library “CleverCSV” that may solve our(my) problem: https://github.com/alan-turing-institute/CleverCSV

It may be interesting to integrate it in pandas directly or using it as backend to read csv instead of the build-in csv library…

I think the best compromise may be something like passing the 30 first line of the csv file to the sniffer. Something like this:

import csv
import pandas as pd


one_col_csv = """HEADER
hzhrzhjtrj
rzthyy
azr231
zrhtzrh
zrhtzrh"""


with open("dummy.csv", "w") as file:
    file.write(one_col_csv)

def read_csv(path: str, sep: str, **kwargs) -> pd.DataFrame:
    
    N_LINE_FOR_SNIFFER = 30 # We take the 3àyh first lines to sniff
    
    if sep is None:
        with open(path) as file:
            line_samples = "\n".join(file.readlines(N_LINE_FOR_SNIFFER))
            try:
                sep = csv.Sniffer().sniff(line_samples).delimiter
            except Exception as e:
                if str(e)=="Could not determine delimiter":
                    sep = "," # Presume it is a single column file, so define a random separator
                else:
                    raise Exception(e)
            finally:
                return pd.read_csv(path, sep=sep, **kwargs)
    else:
        return pd.read_csv(path, sep=sep, **kwargs)
    
#read_csv("dummy.csv", sep=None)

print(pd.read_csv("dummy.csv", sep=None))

It will hugely increase the reliability of the sniffer, while still being fast as we didn’t read all the file… What do you think about this alternative?

(I really think that sep=None is a very convenient feature that should be improve instead of deleted…)

@DeaMariaLeon yes, I do.

Not a fan of the sniffing,

that said exploring passing the first header + 1 lines might be interesting. Should definitely clarify the docs

Passing in the entire csv is also not performant. Also it’s very easy to come up with a csv that would still cause the sniffer to give you the current results.

For example, for the below contrived example - it still outputs 'E' as the delimiter:

In [1]: s = '''HEADER
   ...: THEREWAS'''

In [2]: import csv

In [3]: csv.Sniffer().sniff(s).delimiter
Out[3]: 'E'

Hence my suggestion to either

  1. explicitly document the fact that only the first valid line of the csv is passed to csv.Sniffer to determine the delimiter, or
  2. deprecate and eventually remove sep=None as an option - but I don’t know if pandas uses this internally anywhere else.

The docs for read_csv does mention that it uses the builtin csv.Sniffer

If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer.

I would say this behaves as expected. A line stating that only the first valid line is passed on to csv.Sniffer could probably be added to the docs.