duckdb: rows_update() gives incorrect results
What happens?
When trying to update one table to another using rows_update(), the resulting changes to the database are incorrect.
To Reproduce
See reprex below. Generally, this has been difficult to reproduce in a synthetic setting, but by stripping out most of the contents of the database where it happens, I have been able to create the reprex below - alternatively, see the zipped file here.
Environment (please complete the following information):
- OS: Ubuntu, Windows
- DuckDB Version: 0.3.5-dev1410
- DuckDB Client: R DuckDB package 0.4.0, installed from CRAN (on both Windows and Ubuntu)
Identity Disclosure:
- Full Name: Mark Payne
- Affiliation: Danish Meterological Institute
Before Submitting
- [ x] Have you tried this on the latest
masterbranch?
- Python:
pip install duckdb --upgrade --pre - R:
install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL) - Other Platforms: You can find binaries here or compile from source.
Checked on latest version on CRAN.
- [x ] Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
Yes.
Reprex
library(duckdb)
#> Loading required package: DBI
library(tidyverse)
library(dbplyr)
#Connect to database
this.db <- dbConnect(duckdb(),"MPI-ESM.duckdb")
time.tb <- tbl(this.db,"time")
doy.tb <- tbl(this.db,"doy")
#Check contents of doy and time tables. Note in particular the doy column, which is what we're interested in
time.tb
#> # Source: table<time> [?? x 8]
#> # Database: DuckDB 0.3.5-dev1410 [mapa@Linux 5.15.0-40-generic:R 4.1.2/MPI-ESM.duckdb]
#> tpKey timeID file year month day season doy
#> <dbl> <dbl> <chr> <int> <int> <int> <fct> <int>
#> 1 0 0.5 /dmidata/projects/klimaatlas/map⦠1850 1 1 DJF NA
#> 2 1 1.5 /dmidata/projects/klimaatlas/map⦠1850 1 2 DJF NA
#> 3 2 2.5 /dmidata/projects/klimaatlas/map⦠1850 1 3 DJF NA
#> 4 3 3.5 /dmidata/projects/klimaatlas/map⦠1850 1 4 DJF NA
#> 5 4 4.5 /dmidata/projects/klimaatlas/map⦠1850 1 5 DJF NA
#> 6 5 5.5 /dmidata/projects/klimaatlas/map⦠1850 1 6 DJF NA
#> 7 6 6.5 /dmidata/projects/klimaatlas/map⦠1850 1 7 DJF NA
#> 8 7 7.5 /dmidata/projects/klimaatlas/map⦠1850 1 8 DJF NA
#> 9 8 8.5 /dmidata/projects/klimaatlas/map⦠1850 1 9 DJF NA
#> 10 9 9.5 /dmidata/projects/klimaatlas/map⦠1850 1 10 DJF NA
#> # ⦠with more rows
doy.tb
#> # Source: table<doy> [?? x 3]
#> # Database: DuckDB 0.3.5-dev1410 [mapa@Linux 5.15.0-40-generic:R 4.1.2/MPI-ESM.duckdb]
#> month day doy
#> <int> <int> <dbl>
#> 1 1 1 1
#> 2 1 2 2
#> 3 1 3 3
#> 4 1 4 4
#> 5 1 5 5
#> 6 1 6 6
#> 7 1 7 7
#> 8 1 8 8
#> 9 1 9 9
#> 10 1 10 10
#> # ⦠with more rows
#But doing the modification leads to problems
rows_update(time.tb,
doy.tb,
by=c("month","day"),
unmatched="ignore",
in_place = TRUE) #NB: in_place=FALSE gives the correct result, but doesn't modify the database
#Sometimes the problem is apparent at the head of the table
time.tb
#> # Source: table<time> [?? x 8]
#> # Database: DuckDB 0.3.5-dev1410 [mapa@Linux 5.15.0-40-generic:R 4.1.2/MPI-ESM.duckdb]
#> tpKey timeID file year month day season doy
#> <dbl> <dbl> <chr> <int> <int> <int> <fct> <int>
#> 1 0 0.5 /dmidata/projects/klimaatlas/m⦠1850 1 1 DJF -3.38e8
#> 2 1 1.5 /dmidata/projects/klimaatlas/m⦠1850 1 2 DJF 2.2 e4
#> 3 2 2.5 /dmidata/projects/klimaatlas/m⦠1850 1 3 DJF -3.37e8
#> 4 3 3.5 /dmidata/projects/klimaatlas/m⦠1850 1 4 DJF 2.2 e4
#> 5 4 4.5 /dmidata/projects/klimaatlas/m⦠1850 1 5 DJF 0
#> 6 5 5.5 /dmidata/projects/klimaatlas/m⦠1850 1 6 DJF 0
#> 7 6 6.5 /dmidata/projects/klimaatlas/m⦠1850 1 7 DJF 0
#> 8 7 7.5 /dmidata/projects/klimaatlas/m⦠1850 1 8 DJF 0
#> 9 8 8.5 /dmidata/projects/klimaatlas/m⦠1850 1 9 DJF 0
#> 10 9 9.5 /dmidata/projects/klimaatlas/m⦠1850 1 10 DJF 0
#> # ⦠with more rows
#But it really becomes apparent if we look across the entire table
time.tb%>%
summarise(min=min(doy,na.rm=TRUE),
max=max(doy,na.rm=TRUE))
#> # Source: SQL [1 x 2]
#> # Database: DuckDB 0.3.5-dev1410 [mapa@Linux 5.15.0-40-generic:R 4.1.2/MPI-ESM.duckdb]
#> min max
#> <int> <int>
#> 1 -2147221504 2063663104
Created on 2022-07-20 by the reprex package (v2.0.1)
Session info
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4
#> [5] readr_2.1.2 tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6
#> [9] tidyverse_1.3.1 duckdb_0.4.0 DBI_1.1.3
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.1.2 xfun_0.31 haven_2.5.0 colorspace_2.0-3
#> [5] vctrs_0.4.1 generics_0.1.2 htmltools_0.5.2 yaml_2.3.5
#> [9] blob_1.2.3 utf8_1.2.2 rlang_1.0.3 pillar_1.7.0
#> [13] glue_1.6.2 withr_2.5.0 dbplyr_2.2.1 readxl_1.4.0
#> [17] modelr_0.1.8 lifecycle_1.0.1 cellranger_1.1.0 munsell_0.5.0
#> [21] gtable_0.3.0 rvest_1.0.2 evaluate_0.15 knitr_1.39
#> [25] tzdb_0.3.0 fastmap_1.1.0 fansi_1.0.3 highr_0.9
#> [29] broom_1.0.0 backports_1.4.1 scales_1.2.0 jsonlite_1.8.0
#> [33] fs_1.5.2 hms_1.1.1 digest_0.6.29 stringi_1.7.6
#> [37] grid_4.1.2 cli_3.3.0 tools_4.1.2 magrittr_2.0.3
#> [41] crayon_1.5.1 pkgconfig_2.0.3 ellipsis_0.3.2 xml2_1.3.3
#> [45] reprex_2.0.1 lubridate_1.8.0 assertthat_0.2.1 rmarkdown_2.14
#> [49] httr_1.4.3 rstudioapi_0.13 R6_2.5.1 compiler_4.1.2
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (8 by maintainers)
Commits related to this issue
- Fix #4173: constant validity segment can only output constant vectors if the input vector is a constant as well — committed to Mytherin/duckdb by Mytherin 2 years ago
Thanks for the data generation script. I was able to reduce it further to the following one (that still results in the bug most of the time on the latest master (R-package installed from the R-universe)):
Created on 2022-07-22 by the reprex package (v2.0.1)
So, @Mytherin here is the pure SQL script that reproduces the bug if database is in file instead of
:memory::So in the end for those two SELECT-queries there should not be any other values for doy than 1 (and NULL), but with MIN/MAX invalid values (i.e. not equal to 1) can be seen.