duckdb: CSV file that parses with 0.7.1 fails with 0.8.0 and `master`

What happens?

A CSV file that parsed successfully in 0.7.1 fails to parse since 0.8.0.

To Reproduce

master:

❯ build/release/duckdb --version
v0.8.1-dev253 0d946c04df
❯ curl -O https://gist.githubusercontent.com/pybokeh/281229dae21a5786a161394f181928e5/raw/12ea7f0ed4ed59a7d8b6c7eafacfd6bd76dcebde/CrashStatistics.csv
❯ build/release/duckdb <<< "select * from read_csv_auto('CrashStatistics.csv')"
Error: near line 1: Invalid Input Error: Error in file "/home/cloud/data/CrashStatistics.csv" on line 3219: quote should be followed by end of value, end of row or another quote. (  file=/home/cloud/data/CrashStatistics.csv
  delimiter=',' (auto detected)
  quote='"' (auto detected)
  escape='"' (auto detected)
  header=1 (auto detected)
  sample_size=20480
  ignore_errors=0
  all_varchar=0).

0.7.1:

❯ build/release/duckdb --version
v0.7.1 b00b93f0b1
❯ build/release/duckdb <<< "select * from read_csv_auto('CrashStatistics.csv')"
... a bunch of table output ...

OS:

Linux x86_64

DuckDB Version:

0.8.0

DuckDB Client:

CLI, Python

Full Name:

Phillip Cloud

Affiliation:

Voltron Data

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (11 by maintainers)

Most upvoted comments

The problem here is that we detect the wrong options for quotes.

Because detecting CSV Options is a combinatory problem, we don’t run them over the set sample size, but rather we only run them on one chunk.

The workaround here is to set the quote manually.

To properly fix this issue, we have to run the option detector on the full sample size; the reason we don’t do that currently is that it’s not very efficient. I think we can eliminate the combinatorial explosion issue (i.e., the nested loops in all the different combinations of options) by using the state machine idea described in #7213 on the option detector. Then we have to maintain one state machine per option combination, but we can also do early-prune based on the same heuristics we currently use on the sniffer.

I’ll start working on that next monday.

pdet on Jun 2, 2023

Here’s the CSV in question: https://gist.github.com/pybokeh/281229dae21a5786a161394f181928e5

gforsyth on Jun 1, 2023