duckdb: invalid unicode in segments statistics
Working in R with DuckDB package no issue so far, trying to load a big dataframe into database I got an error I cannot understand:
Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!
My guess is the dataframe is too wide (635 columns) or too long 500.000 rows
Windows 10. duckdb jdbc connection 0.24 driver, Rstudio
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 62 (9 by maintainers)
Thanks for reporting - could you open a new issue in the new duckdb-r repo?
For those who read this issue, I too ran into it and when working with R, I could clean up a list of dataframes like so:
After this the error disappeared.
Or I should say…now I got a different error. =)
Same query I posted above now causes an OOM exception. Full report here: https://github.com/duckdb/duckdb/issues/5315
Could you retry this with the very latest master? I suspect this issue has been fixed by #5263.
Also maybe worth mentioning…the failure only happens if the inside a
CREATE TABLEstatement. The query by itself works:If anyone’s interested I can share the full dataset, although it’ll be slightly unweildy (~10 GB, probably).
Thanks for the reproducible example! I will have a look at fixing this for the next release.
FWIW, I’ve tried to reproduce the error on a Windows 11 machine by creating an input csv populated with a variety of values from the big list of naughty strings.
I’ve been unable to reproduce the error. I think the ‘problem’ (i.e. the reason it is not causing an error) is probably that my csv file, whilst containing various unicode characters, is correctly encoded in utf-8. But I’ve experimented by incorrectly saving unicode characters in Windows-1252 encoding as well, with no luck in reproducing the problem either.
My best guess at the moment is that this is something to do with Microsoft SQL Server, which (I believe), usually stores data in Windows-1252 (cp1252) encoding, and how this data gets loaded into Pandas. But this is pure guesswork based on encoding errors I’ve run into in the past interacting with MS SQL Server, and the fact that all of our users who’ve seen this error so far have sourced the data from MS SQL Server.
We’ve had reports of this problem from quite a few users of Splink (a Python library which uses DuckDB as a backend). Although sadly I haven’t got a minimal reproducible example (because, being on mac, i’ve never seen the error myself) a common theme seems to be these errors are occurring on Windows, with data sourced from Microsoft SQL server. I will do my best to try and obtain a reprex
https://github.com/moj-analytical-services/splink/issues/664
@GitHunter0 Thanks for providing a reproducible example, I can confirm the issue at least in the latest CRAN release.
That is not a known bug. Could you please file a bug report? Thanks!