duckdb: InternalException: INTERNAL Error: Something went wrong when reading the CSV file, more positions than columns
What happens?
I can’t read this csv file, it has different number of columns (130), I manually defined the schema with all those columns but could not make it to works, it works fine with pandas using the options
df = pd.read_csv(Path+filename,
skiprows=1,
dtype=str,
engine='python',
names=range(130))
attached a zipped example PUBLIC_DAILY_202311170000_20231118040504.zip
To Reproduce
I am running this query
import duckdb
duckdb.sql("""from read_csv("PUBLIC_DAILY_202311170000_20231118040504.CSV",
Skip=1,
header =0,
sample_size=-1,
all_varchar=1,
columns={
'C1': 'VARCHAR',
'C2': 'VARCHAR',
'C3': 'VARCHAR',
'C4': 'VARCHAR',
'C5': 'VARCHAR',
'C6': 'VARCHAR',
'C7': 'VARCHAR',
'C8': 'VARCHAR',
'C9': 'VARCHAR',
'C10': 'VARCHAR',
'C11': 'VARCHAR',
'C12': 'VARCHAR',
'C13': 'VARCHAR',
'C14': 'VARCHAR',
'C15': 'VARCHAR',
'C16': 'VARCHAR',
'C17': 'VARCHAR',
'C18': 'VARCHAR',
'C19': 'VARCHAR',
'C20': 'VARCHAR',
'C21': 'VARCHAR',
'C22': 'VARCHAR',
'C23': 'VARCHAR',
'C24': 'VARCHAR',
'C25': 'VARCHAR',
'C26': 'VARCHAR',
'C27': 'VARCHAR',
'C28': 'VARCHAR',
'C29': 'VARCHAR',
'C30': 'VARCHAR',
'C31': 'VARCHAR',
'C32': 'VARCHAR',
'C33': 'VARCHAR',
'C34': 'VARCHAR',
'C35': 'VARCHAR',
'C36': 'VARCHAR',
'C37': 'VARCHAR',
'C38': 'VARCHAR',
'C39': 'VARCHAR',
'C40': 'VARCHAR',
'C41': 'VARCHAR',
'C42': 'VARCHAR',
'C43': 'VARCHAR',
'C44': 'VARCHAR',
'C45': 'VARCHAR',
'C46': 'VARCHAR',
'C47': 'VARCHAR',
'C48': 'VARCHAR',
'C49': 'VARCHAR',
'C50': 'VARCHAR',
'C51': 'VARCHAR',
'C52': 'VARCHAR',
'C53': 'VARCHAR',
'C54': 'VARCHAR',
'C55': 'VARCHAR',
'C56': 'VARCHAR',
'C57': 'VARCHAR',
'C58': 'VARCHAR',
'C59': 'VARCHAR',
'C60': 'VARCHAR',
'C61': 'VARCHAR',
'C62': 'VARCHAR',
'C63': 'VARCHAR',
'C64': 'VARCHAR',
'C65': 'VARCHAR',
'C66': 'VARCHAR',
'C67': 'VARCHAR',
'C68': 'VARCHAR',
'C69': 'VARCHAR',
'C70': 'VARCHAR',
'C71': 'VARCHAR',
'C72': 'VARCHAR',
'C73': 'VARCHAR',
'C74': 'VARCHAR',
'C75': 'VARCHAR',
'C76': 'VARCHAR',
'C77': 'VARCHAR',
'C78': 'VARCHAR',
'C79': 'VARCHAR',
'C80': 'VARCHAR',
'C81': 'VARCHAR',
'C82': 'VARCHAR',
'C83': 'VARCHAR',
'C84': 'VARCHAR',
'C85': 'VARCHAR',
'C86': 'VARCHAR',
'C87': 'VARCHAR',
'C88': 'VARCHAR',
'C89': 'VARCHAR',
'C90': 'VARCHAR',
'C91': 'VARCHAR',
'C92': 'VARCHAR',
'C93': 'VARCHAR',
'C94': 'VARCHAR',
'C95': 'VARCHAR',
'C96': 'VARCHAR',
'C97': 'VARCHAR',
'C98': 'VARCHAR',
'C99': 'VARCHAR',
'C100': 'VARCHAR',
'C101': 'VARCHAR',
'C102': 'VARCHAR',
'C103': 'VARCHAR',
'C104': 'VARCHAR',
'C105': 'VARCHAR',
'C106': 'VARCHAR',
'C107': 'VARCHAR',
'C108': 'VARCHAR',
'C109': 'VARCHAR',
'C110': 'VARCHAR',
'C111': 'VARCHAR',
'C112': 'VARCHAR',
'C113': 'VARCHAR',
'C114': 'VARCHAR',
'C115': 'VARCHAR',
'C116': 'VARCHAR',
'C117': 'VARCHAR',
'C118': 'VARCHAR',
'C119': 'VARCHAR',
'C120': 'VARCHAR',
'C121': 'VARCHAR',
'C122': 'VARCHAR',
'C123': 'VARCHAR',
'C124': 'VARCHAR',
'C125': 'VARCHAR',
'C126': 'VARCHAR',
'C127': 'VARCHAR',
'C128': 'VARCHAR',
'C129': 'VARCHAR',
'C130': 'VARCHAR'
},
ignore_errors=true,
auto_detect=false) """).show()
OS:
Linux
DuckDB Version:
duckdb-0.9.3.dev2938
DuckDB Client:
Python
Full Name:
mim
Affiliation:
Personal
Have you tried this on the latest main branch?
I have tested with a main build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- Yes, I have
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Comments: 23 (7 by maintainers)
different Pedro, I suppose 😃
Yeah, I thought this line would never hit because would be caught by the sniffer before. That’s why I set it as an internal error, but of course, you can hit it if you disable the sniffer. I will change it to a proper error message with suggestions like using
null_paddingorignore_errors.@pdet Thanks, it is back to normal, the latest night release fix it
@pdet actually it works fine, the error was from a previous write, thanks for fixing the issue
I guess that’s when you read parquet files generated by the copy to? If so, it’s potentially an issue with either the parquet writer or reader, I’ll have a look in the coming days 😃
Thanks again for all the testing here Mimoune, highly appreciated!
@pedroerp thanks, it works now
@djouallah Thanks for giving it a try. This is an interesting case for sure.
To diminish our back-and-forth communication, could you provide me with your whole workload?
Hi @djouallah thanks for reporting this. In your original code (which I’ve since edited) you referred to
PUBLIC_DAILY_201804050000_20180406040501.CSVbut the attached file containsPUBLIC_DAILY_202311170000_20231118040504.CSV. The latter file does not reproduce the issue for me. Can you please double check?