duckdb: rows_update() gives incorrect results

What happens?

When trying to update one table to another using rows_update(), the resulting changes to the database are incorrect.

To Reproduce

See reprex below. Generally, this has been difficult to reproduce in a synthetic setting, but by stripping out most of the contents of the database where it happens, I have been able to create the reprex below - alternatively, see the zipped file here.

Zipped_example.zip

Environment (please complete the following information):

  • OS: Ubuntu, Windows
  • DuckDB Version: 0.3.5-dev1410
  • DuckDB Client: R DuckDB package 0.4.0, installed from CRAN (on both Windows and Ubuntu)

Identity Disclosure:

  • Full Name: Mark Payne
  • Affiliation: Danish Meterological Institute

Before Submitting

  • [ x] Have you tried this on the latest master branch?
  • Python: pip install duckdb --upgrade --pre
  • R: install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
  • Other Platforms: You can find binaries here or compile from source.

Checked on latest version on CRAN.

  • [x ] Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Yes.

Reprex

library(duckdb)
#> Loading required package: DBI
library(tidyverse)
library(dbplyr)

#Connect to database
this.db <- dbConnect(duckdb(),"MPI-ESM.duckdb")
time.tb <- tbl(this.db,"time")
doy.tb <- tbl(this.db,"doy")

#Check contents of doy and time tables. Note in particular the doy column, which is what we're interested in
time.tb
#> # Source:   table<time> [?? x 8]
#> # Database: DuckDB 0.3.5-dev1410 [mapa@Linux 5.15.0-40-generic:R 4.1.2/MPI-ESM.duckdb]
#>    tpKey timeID file                               year month   day season   doy
#>    <dbl>  <dbl> <chr>                             <int> <int> <int> <fct>  <int>
#>  1     0    0.5 /dmidata/projects/klimaatlas/map…  1850     1     1 DJF       NA
#>  2     1    1.5 /dmidata/projects/klimaatlas/map…  1850     1     2 DJF       NA
#>  3     2    2.5 /dmidata/projects/klimaatlas/map…  1850     1     3 DJF       NA
#>  4     3    3.5 /dmidata/projects/klimaatlas/map…  1850     1     4 DJF       NA
#>  5     4    4.5 /dmidata/projects/klimaatlas/map…  1850     1     5 DJF       NA
#>  6     5    5.5 /dmidata/projects/klimaatlas/map…  1850     1     6 DJF       NA
#>  7     6    6.5 /dmidata/projects/klimaatlas/map…  1850     1     7 DJF       NA
#>  8     7    7.5 /dmidata/projects/klimaatlas/map…  1850     1     8 DJF       NA
#>  9     8    8.5 /dmidata/projects/klimaatlas/map…  1850     1     9 DJF       NA
#> 10     9    9.5 /dmidata/projects/klimaatlas/map…  1850     1    10 DJF       NA
#> # … with more rows
doy.tb
#> # Source:   table<doy> [?? x 3]
#> # Database: DuckDB 0.3.5-dev1410 [mapa@Linux 5.15.0-40-generic:R 4.1.2/MPI-ESM.duckdb]
#>    month   day   doy
#>    <int> <int> <dbl>
#>  1     1     1     1
#>  2     1     2     2
#>  3     1     3     3
#>  4     1     4     4
#>  5     1     5     5
#>  6     1     6     6
#>  7     1     7     7
#>  8     1     8     8
#>  9     1     9     9
#> 10     1    10    10
#> # … with more rows

#But doing the modification leads to problems
rows_update(time.tb,
            doy.tb,
            by=c("month","day"),
            unmatched="ignore",
            in_place = TRUE) #NB: in_place=FALSE gives the correct result, but doesn't modify the database

#Sometimes the problem is apparent at the head of the table
time.tb
#> # Source:   table<time> [?? x 8]
#> # Database: DuckDB 0.3.5-dev1410 [mapa@Linux 5.15.0-40-generic:R 4.1.2/MPI-ESM.duckdb]
#>    tpKey timeID file                             year month   day season     doy
#>    <dbl>  <dbl> <chr>                           <int> <int> <int> <fct>    <int>
#>  1     0    0.5 /dmidata/projects/klimaatlas/m…  1850     1     1 DJF    -3.38e8
#>  2     1    1.5 /dmidata/projects/klimaatlas/m…  1850     1     2 DJF     2.2 e4
#>  3     2    2.5 /dmidata/projects/klimaatlas/m…  1850     1     3 DJF    -3.37e8
#>  4     3    3.5 /dmidata/projects/klimaatlas/m…  1850     1     4 DJF     2.2 e4
#>  5     4    4.5 /dmidata/projects/klimaatlas/m…  1850     1     5 DJF     0     
#>  6     5    5.5 /dmidata/projects/klimaatlas/m…  1850     1     6 DJF     0     
#>  7     6    6.5 /dmidata/projects/klimaatlas/m…  1850     1     7 DJF     0     
#>  8     7    7.5 /dmidata/projects/klimaatlas/m…  1850     1     8 DJF     0     
#>  9     8    8.5 /dmidata/projects/klimaatlas/m…  1850     1     9 DJF     0     
#> 10     9    9.5 /dmidata/projects/klimaatlas/m…  1850     1    10 DJF     0     
#> # … with more rows

#But it really becomes apparent if we look across the entire table
time.tb%>% 
  summarise(min=min(doy,na.rm=TRUE),
            max=max(doy,na.rm=TRUE)) 
#> # Source:   SQL [1 x 2]
#> # Database: DuckDB 0.3.5-dev1410 [mapa@Linux 5.15.0-40-generic:R 4.1.2/MPI-ESM.duckdb]
#>           min        max
#>         <int>      <int>
#> 1 -2147221504 2063663104

Created on 2022-07-20 by the reprex package (v2.0.1)

Session info
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4    
#>  [5] readr_2.1.2     tidyr_1.2.0     tibble_3.1.7    ggplot2_3.3.6  
#>  [9] tidyverse_1.3.1 duckdb_0.4.0    DBI_1.1.3      
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.2 xfun_0.31        haven_2.5.0      colorspace_2.0-3
#>  [5] vctrs_0.4.1      generics_0.1.2   htmltools_0.5.2  yaml_2.3.5      
#>  [9] blob_1.2.3       utf8_1.2.2       rlang_1.0.3      pillar_1.7.0    
#> [13] glue_1.6.2       withr_2.5.0      dbplyr_2.2.1     readxl_1.4.0    
#> [17] modelr_0.1.8     lifecycle_1.0.1  cellranger_1.1.0 munsell_0.5.0   
#> [21] gtable_0.3.0     rvest_1.0.2      evaluate_0.15    knitr_1.39      
#> [25] tzdb_0.3.0       fastmap_1.1.0    fansi_1.0.3      highr_0.9       
#> [29] broom_1.0.0      backports_1.4.1  scales_1.2.0     jsonlite_1.8.0  
#> [33] fs_1.5.2         hms_1.1.1        digest_0.6.29    stringi_1.7.6   
#> [37] grid_4.1.2       cli_3.3.0        tools_4.1.2      magrittr_2.0.3  
#> [41] crayon_1.5.1     pkgconfig_2.0.3  ellipsis_0.3.2   xml2_1.3.3      
#> [45] reprex_2.0.1     lubridate_1.8.0  assertthat_0.2.1 rmarkdown_2.14  
#> [49] httr_1.4.3       rstudioapi_0.13  R6_2.5.1         compiler_4.1.2

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (8 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks for the data generation script. I was able to reduce it further to the following one (that still results in the bug most of the time on the latest master (R-package installed from the R-universe)):

fname <- tempfile()
unlink(fname)

con <- DBI::dbConnect(duckdb::duckdb(), fname)

DBI::dbExecute(con, "CREATE TABLE time(month INTEGER, day INTEGER, doy INTEGER, UNIQUE (month, day))")
#> [1] 0
DBI::dbExecute(con, "INSERT INTO time SELECT i AS month, i AS day, NULL AS doy FROM range(1025) tbl(i)")
#> [1] 1025
DBI::dbExecute(con, "CREATE TABLE doy AS SELECT 1 AS month, 1 AS day, 1.0 AS doy")
#> [1] 1
DBI::dbExecute(con, "CHECKPOINT")
#> [1] 0
DBI::dbExecute(con, "UPDATE time SET doy = doy.doy FROM doy WHERE (doy.month = time.month) AND (doy.day = time.day)")
#> [1] 1
DBI::dbGetQuery(con, "SELECT * FROM time WHERE doy<1 OR doy>1")
#> [1] month day   doy  
#> <0 rows> (or 0-length row.names)
DBI::dbGetQuery(con, "SELECT MIN(doy), MAX(doy) FROM time")
#>   min(doy) max(doy)
#> 1      501      501

DBI::dbDisconnect(con,shutdown=TRUE)

Created on 2022-07-22 by the reprex package (v2.0.1)

So, @Mytherin here is the pure SQL script that reproduces the bug if database is in file instead of :memory::

CREATE TABLE time(month INTEGER, day INTEGER, doy INTEGER, UNIQUE (month, day));
INSERT INTO time SELECT i AS month, i AS day, NULL AS doy FROM range(1025) tbl(i);
CREATE TABLE doy AS SELECT 1 AS month, 1 AS day, 1.0 AS doy;
CHECKPOINT;
UPDATE time SET doy = doy.doy FROM doy WHERE (doy.month = time.month) AND (doy.day = time.day);
SELECT * FROM time WHERE doy<1 OR doy>1;
SELECT MIN(doy), MAX(doy) FROM time;

So in the end for those two SELECT-queries there should not be any other values for doy than 1 (and NULL), but with MIN/MAX invalid values (i.e. not equal to 1) can be seen.