duckdb: InternalException: INTERNAL Error: Something went wrong when reading the CSV file, more positions than columns

What happens?

I can’t read this csv file, it has different number of columns (130), I manually defined the schema with all those columns but could not make it to works, it works fine with pandas using the options

df = pd.read_csv(Path+filename,
    skiprows=1,
    dtype=str,
    engine='python',
    names=range(130))

attached a zipped example PUBLIC_DAILY_202311170000_20231118040504.zip

To Reproduce

I am running this query

import duckdb
duckdb.sql("""from read_csv("PUBLIC_DAILY_202311170000_20231118040504.CSV",
Skip=1,
header =0,
sample_size=-1,
all_varchar=1,
columns={
'C1': 'VARCHAR',
'C2': 'VARCHAR',
'C3': 'VARCHAR',
'C4': 'VARCHAR',
'C5': 'VARCHAR',
'C6': 'VARCHAR',
'C7': 'VARCHAR',
'C8': 'VARCHAR',
'C9': 'VARCHAR',
'C10': 'VARCHAR',
'C11': 'VARCHAR',
'C12': 'VARCHAR',
'C13': 'VARCHAR',
'C14': 'VARCHAR',
'C15': 'VARCHAR',
'C16': 'VARCHAR',
'C17': 'VARCHAR',
'C18': 'VARCHAR',
'C19': 'VARCHAR',
'C20': 'VARCHAR',
'C21': 'VARCHAR',
'C22': 'VARCHAR',
'C23': 'VARCHAR',
'C24': 'VARCHAR',
'C25': 'VARCHAR',
'C26': 'VARCHAR',
'C27': 'VARCHAR',
'C28': 'VARCHAR',
'C29': 'VARCHAR',
'C30': 'VARCHAR',
'C31': 'VARCHAR',
'C32': 'VARCHAR',
'C33': 'VARCHAR',
'C34': 'VARCHAR',
'C35': 'VARCHAR',
'C36': 'VARCHAR',
'C37': 'VARCHAR',
'C38': 'VARCHAR',
'C39': 'VARCHAR',
'C40': 'VARCHAR',
'C41': 'VARCHAR',
'C42': 'VARCHAR',
'C43': 'VARCHAR',
'C44': 'VARCHAR',
'C45': 'VARCHAR',
'C46': 'VARCHAR',
'C47': 'VARCHAR',
'C48': 'VARCHAR',
'C49': 'VARCHAR',
'C50': 'VARCHAR',
'C51': 'VARCHAR',
'C52': 'VARCHAR',
'C53': 'VARCHAR',
'C54': 'VARCHAR',
'C55': 'VARCHAR',
'C56': 'VARCHAR',
'C57': 'VARCHAR',
'C58': 'VARCHAR',
'C59': 'VARCHAR',
'C60': 'VARCHAR',
'C61': 'VARCHAR',
'C62': 'VARCHAR',
'C63': 'VARCHAR',
'C64': 'VARCHAR',
'C65': 'VARCHAR',
'C66': 'VARCHAR',
'C67': 'VARCHAR',
'C68': 'VARCHAR',
'C69': 'VARCHAR',
'C70': 'VARCHAR',
'C71': 'VARCHAR',
'C72': 'VARCHAR',
'C73': 'VARCHAR',
'C74': 'VARCHAR',
'C75': 'VARCHAR',
'C76': 'VARCHAR',
'C77': 'VARCHAR',
'C78': 'VARCHAR',
'C79': 'VARCHAR',
'C80': 'VARCHAR',
'C81': 'VARCHAR',
'C82': 'VARCHAR',
'C83': 'VARCHAR',
'C84': 'VARCHAR',
'C85': 'VARCHAR',
'C86': 'VARCHAR',
'C87': 'VARCHAR',
'C88': 'VARCHAR',
'C89': 'VARCHAR',
'C90': 'VARCHAR',
'C91': 'VARCHAR',
'C92': 'VARCHAR',
'C93': 'VARCHAR',
'C94': 'VARCHAR',
'C95': 'VARCHAR',
'C96': 'VARCHAR',
'C97': 'VARCHAR',
'C98': 'VARCHAR',
'C99': 'VARCHAR',
'C100': 'VARCHAR',
'C101': 'VARCHAR',
'C102': 'VARCHAR',
'C103': 'VARCHAR',
'C104': 'VARCHAR',
'C105': 'VARCHAR',
'C106': 'VARCHAR',
'C107': 'VARCHAR',
'C108': 'VARCHAR',
'C109': 'VARCHAR',
'C110': 'VARCHAR',
'C111': 'VARCHAR',
'C112': 'VARCHAR',
'C113': 'VARCHAR',
'C114': 'VARCHAR',
'C115': 'VARCHAR',
'C116': 'VARCHAR',
'C117': 'VARCHAR',
'C118': 'VARCHAR',
'C119': 'VARCHAR',
'C120': 'VARCHAR',
'C121': 'VARCHAR',
'C122': 'VARCHAR',
'C123': 'VARCHAR',
'C124': 'VARCHAR',
'C125': 'VARCHAR',
'C126': 'VARCHAR',
'C127': 'VARCHAR',
'C128': 'VARCHAR',
'C129': 'VARCHAR',
'C130': 'VARCHAR'

 },
ignore_errors=true,
auto_detect=false) """).show()

OS:

Linux

DuckDB Version:

duckdb-0.9.3.dev2938

DuckDB Client:

Python

Full Name:

mim

Affiliation:

Personal

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Comments: 23 (7 by maintainers)

Most upvoted comments

@pedroerp thanks, it works now

different Pedro, I suppose 😃

Yeah, I thought this line would never hit because would be caught by the sniffer before. That’s why I set it as an internal error, but of course, you can hit it if you disable the sniffer. I will change it to a proper error message with suggestions like using null_padding or ignore_errors.

@pdet Thanks, it is back to normal, the latest night release fix it

@pdet actually it works fine, the error was from a previous write, thanks for fixing the issue

@pdet using latest nightly with windows as Linux is broken, the error is fixed but get new errors with the parquet produced

InvalidInputException: Invalid Input Error: No magic bytes found at end of file 'C:\Users\mim\Desktop\R2\export\scada\file_4312b18d-6d7c-41a9-b14f-fd026e169708.parquet'

I guess that’s when you read parquet files generated by the copy to? If so, it’s potentially an issue with either the parquet writer or reader, I’ll have a look in the coming days 😃

Thanks again for all the testing here Mimoune, highly appreciated!

@pedroerp thanks, it works now

@djouallah Thanks for giving it a try. This is an interesting case for sure.

To diminish our back-and-forth communication, could you provide me with your whole workload?

Hi @djouallah thanks for reporting this. In your original code (which I’ve since edited) you referred to PUBLIC_DAILY_201804050000_20180406040501.CSV but the attached file contains PUBLIC_DAILY_202311170000_20231118040504.CSV. The latter file does not reproduce the issue for me. Can you please double check?