duckdb: invalid unicode in segments statistics

Working in R with DuckDB package no issue so far, trying to load a big dataframe into database I got an error I cannot understand:

Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!

My guess is the dataframe is too wide (635 columns) or too long 500.000 rows

Windows 10. duckdb jdbc connection 0.24 driver, Rstudio

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 62 (9 by maintainers)

Most upvoted comments

Thanks for reporting - could you open a new issue in the new duckdb-r repo?

For those who read this issue, I too ran into it and when working with R, I could clean up a list of dataframes like so:

  my_dataframes |> 
    map(function(x) x |> mutate(across(
      where(is.character), 
      function(x) stringi::stri_encode(x, to = "UTF-8")
    )))

After this the error disappeared.

Or I should say…now I got a different error. =)

Same query I posted above now causes an OOM exception. Full report here: https://github.com/duckdb/duckdb/issues/5315

Yeah, in the spirit of what @BigBangData said, I am still seeing this error crop up on a recent dev build (v0.5.2-dev2082), in a different situation than the one I mentioned earlier in the thread.

For example, I’m now trying to create a table derived from an event log, using a query like this:

CREATE TABLE IF NOT EXISTS first_page_view as (
  SELECT
    user_id,
    page_id,
    min(viewed_at) as first_view,
  FROM page_views
  GROUP BY user_id, page_id
);

This throws: INTERNAL Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update.

Interestingly, the error does not crop up if I process the first half or second half of the source table separately.

This works:

CREATE TABLE IF NOT EXISTS first_page_view as (
  SELECT
    user_id,
    page_id,
    min(viewed_at) as first_view,
  FROM page_views
  WHERE viewed_at < '2021-01-01'
  GROUP BY user_id, page_id
);

As does this:

CREATE TABLE IF NOT EXISTS first_page_view as (
  SELECT
    user_id,
    page_id,
    min(viewed_at) as first_view,
  FROM page_views
  WHERE viewed_at >= '2021-01-01'
  GROUP BY user_id, page_id
);

So it would seem the error is not being triggered by any particular source rows, but may somehow be related to the volume of data being run through the GROUP BY.

Could you retry this with the very latest master? I suspect this issue has been fixed by #5263.

Also maybe worth mentioning…the failure only happens if the inside a CREATE TABLE statement. The query by itself works:

SELECT
  user_id,
  page_id,
  min(viewed_at) as first_view,
FROM page_views
GROUP BY user_id, page_id;

If anyone’s interested I can share the full dataset, although it’ll be slightly unweildy (~10 GB, probably).

Thanks for the reproducible example! I will have a look at fixing this for the next release.

FWIW, I’ve tried to reproduce the error on a Windows 11 machine by creating an input csv populated with a variety of values from the big list of naughty strings.

I’ve been unable to reproduce the error. I think the ‘problem’ (i.e. the reason it is not causing an error) is probably that my csv file, whilst containing various unicode characters, is correctly encoded in utf-8. But I’ve experimented by incorrectly saving unicode characters in Windows-1252 encoding as well, with no luck in reproducing the problem either.

My best guess at the moment is that this is something to do with Microsoft SQL Server, which (I believe), usually stores data in Windows-1252 (cp1252) encoding, and how this data gets loaded into Pandas. But this is pure guesswork based on encoding errors I’ve run into in the past interacting with MS SQL Server, and the fact that all of our users who’ve seen this error so far have sourced the data from MS SQL Server.

We’ve had reports of this problem from quite a few users of Splink (a Python library which uses DuckDB as a backend). Although sadly I haven’t got a minimal reproducible example (because, being on mac, i’ve never seen the error myself) a common theme seems to be these errors are occurring on Windows, with data sourced from Microsoft SQL server. I will do my best to try and obtain a reprex

https://github.com/moj-analytical-services/splink/issues/664

@GitHunter0 Thanks for providing a reproducible example, I can confirm the issue at least in the latest CRAN release.

That is not a known bug. Could you please file a bug report? Thanks!