pandas: read_csv "CParserError: Error tokenizing data" with variable number of fields

I am having trouble with read_csv (Pandas 0.17.0) when trying to read a 380+ MB csv file. The file starts with 54 fields but some lines have 53 fields instead of 54. Running the below code gives me the following error:

parser = lambda x: datetime.strptime(x, '%y %m %d %H %M %S %f')
df = pd.read_csv(filename,
                         names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                                'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                                'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                                'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                                'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                                'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
                        usecols=range(0, 42),
                        parse_dates={"TIMESTAMP": [0, 1, 2, 3, 4, 5, 6]},
                        date_parser=parser,
                        header=None)

Error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

If I pass the error_bad_lines=False keyword, problematic lines are displayed similar to the example below:

Skipping line 1683401: expected 53 fields, saw 54

however I get the following error this time ( also the DataFrame does not get loaded):

CParserError: Too many columns specified: expected 54 and found 53

If I pass the engine='python' keyword, I do not get any errors, but it takes a really long time to parse the data. Please note that 53 and 54 are switched in the error messages depending on if error_bad_lines=False is used or not.

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Comments: 21 (4 by maintainers)

Most upvoted comments

Try this:

  df = pd.read_csv(filename,header=None,error_bad_lines=False)

Per further investigation, following sequence of commands works (I lose the first line of the data -no header=None present-, but at least it loads):

df = pd.read_csv(filename, 
                 usecols=range(0, 42))
df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                        'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                        'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                        'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                        'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                        'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

Following does NOT work:

df = pd.read_csv(filename,
                 names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
                        'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
                        'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
                        'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
                        'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
                        'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
                 usecols=range(0, 42))

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

Following does NOT work:

df = pd.read_csv(filename,
                 header=None)

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

If engine='python' is used, curiously, it loads the DataFrame without any hiccups though. I used the following snippet to extract the first 3 lines in the file and 3 of the offending lines (got the line numbers from the error message).

from csv import reader
N = int(input('What line do you need? > '))
with open(filename) as f:
    print(next((x for i, x in enumerate(reader(f)) if i == N), None))

Lines 1-3:

['08', '8', '7', '5', '0', '12', '54', '0', '11', '1', '58', '9', '68', '48.2', '0.756', '11.6', '17.5', '13.3', '4.3', '11.3', '32.2', '6.4', '4.1', '5.6', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['08', '8', '7', '5', '0', '15', '80', '0', '11', '1', '62', '9', '69', '77.8', '3.267', '11.2', '17.7', '14.8', '4.2', '15.2', '29.1', '18.4', '10.0', '18.1', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['08', '8', '7', '5', '0', '21', '52', '0', '11', '1', '61', '11', '51', '29.4', '0.076', '4.1', '13.8', '8.3', '21.5', '5.3', '3.1', '5.7', '3.0', '6.1', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']

Offending lines:

['09', '9', '15', '22', '46', '9', '51', '0', '11', '1', '57', '9', '70', '36.3', '0.242', '11.8', '16.2', '6.4', '4.1', '5.8', '31.3', '5.5', '3.9', '6.8', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['09', '9', '15', '22', '46', '25', '31', '0', '11', '1', '70', '9', '73', '67.8', '2.196', '10.4', '17.0', '13.4', '4.4', '12.2', '31.8', '15.6', '4.2', '16.2', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']
['09', '9', '15', '22', '46', '28', '41', '0', '11', '1', '70', '5', '22', '7.4', '0.003', '4.0', '13.1', '3.4', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '', '', '', '', '', '', '', '', '', '', '', '32']

As you suggested, I will try to read the file, then modify the DataFrame (rename columns, delete unnecessary ones etc.) or simply use the python engine (long processing time).

If you will try to import .xlsx file using the command pd.read_csv, you will get this error.

Try using pd.read_excel instead of pd.read_csv

error_bad_lines=False it works

I got the same error

I just add the delimiter parameter in read_csv mode

and it worked

pd.read_csv(filename, header=None) gives the following error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

is expected as the number of columns is inferred from the first line. If you pass names if will use this as a determining feature.

So keep trying various options. You are constraining it a bit much actually with names and usecols. You might be better off reading it in, then reindexing to what you need.

The easiest way to fix it is to transform your CSV file to Excel file and use pd.read_excel instead of pd.read_csv for read the data

pd.read_csv(filename, header=None) gives the following error:

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

is expected as the number of columns is inferred from the first line. If you pass names if will use this as a determining feature.

So keep trying various options. You are constraining it a bit much actually with names and usecols. You might be better off reading it in, then reindexing to what you need.

This works! I write csv using R language and try to read it in python. The first line should have the maximum length for all lines. This way will fix the problem of bad lines and will not lose any lines.