arrow: [R] Segmentation fault when using write_parquet()

Describe the bug, including details regarding any error messages, version, and platform.

I am randomly getting segfault when using write_parquet() with the latest release (the same code works well with v 10.0.1).

 *** caught segfault ***
address 0x18, cause 'memory not mapped'

Traceback:
 1: Table__from_dots(dots, schema, option_use_threads())
 2: Table$create(x, schema = schema)
 3: as_arrow_table.data.frame(x)
 4: as_arrow_table(x)
 5: doTryCatch(return(expr), name, parentenv, handler)
 6: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7: tryCatchList(expr, classes, parentenv, handlers)
 8: tryCatch(as_arrow_table(x), arrow_no_method_as_arrow_table = function(e) {    abort("Object must be coercible to an Arrow Table using `as_arrow_table()`",         parent = e, call = caller_env(2))})
 9: as_writable_table(x)
10: write_parquet(bioargo_dark_corrected, here("data", "raw", "bioargo",     "bioargo_correction_c.parquet"))
11: eval(ei, envir)
12: eval(ei, envir)
13: withVisible(eval(ei, envir))
14: source(here("R", "001c_bioargo_chla_dark_correction.R"))

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

Following this (https://arrow.apache.org/docs/7.0/r/articles/developers/debugging.html), here is the exact line when the code crashes.

Thread 1 "R" received signal SIGSEGV, Segmentation fault.
0x00007fffdccb0956 in std::__shared_ptr<arrow::DataType, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=0x7ffffffc4d50) at /usr/include/c++/12/bits/shared_ptr_base.h:1522
1522          __shared_ptr(const __shared_ptr&) noexcept = default;
$> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.10

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8       
 [4] LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gsw_1.1-1              fishmethods_1.12-0     santoku_0.9.0          arrow_11.0.0.100000089
 [5] yardstick_1.1.0        workflowsets_1.0.0     workflows_1.1.2        tune_1.0.1            
 [9] rsample_1.1.1          recipes_1.0.4          parsnip_1.0.3          modeldata_1.1.0       
[13] infer_1.0.4            dials_1.1.0            scales_1.2.1           broom_1.0.3           
[17] tidymodels_1.0.0       data.table_1.14.6      furrr_0.3.1            future_1.31.0         
[21] pins_1.1.0             tidyterra_0.3.1        terra_1.7-3            sf_1.0-9              
[25] patchwork_1.1.2        tidync_0.3.0           here_1.0.1             glue_1.6.2            
[29] ggpmthemes_0.0.2       lubridate_1.9.2        forcats_1.0.0          stringr_1.5.0         
[33] dplyr_1.1.0            purrr_1.0.1            readr_2.1.4            tidyr_1.3.0           
[37] tibble_3.1.8           ggplot2_3.4.1          tidyverse_1.3.2.9000  

loaded via a namespace (and not attached):
 [1] minqa_1.2.5         colorspace_2.1-0    ellipsis_0.3.2      class_7.3-21       
 [5] rprojroot_2.0.3     fs_1.6.1            rstudioapi_0.14     proxy_0.4-27       
 [9] farver_2.1.1        listenv_0.9.0       bit64_4.0.5         prodlim_2019.11.13 
[13] fansi_1.0.4         codetools_0.2-19    splines_4.2.2       ncdf4_1.21         
[17] extrafont_0.19      jsonlite_1.8.4      nloptr_2.0.3        Rttf2pt1_1.3.12    
[21] compiler_4.2.2      backports_1.4.1     assertthat_0.2.1    Matrix_1.5-3       
[25] cli_3.6.0           tools_4.2.2         gtable_0.3.1        rappdirs_0.3.3     
[29] Rcpp_1.0.10         RNetCDF_2.6-2       DiceDesign_1.9      vctrs_0.5.2        
[33] nlme_3.1-162        extrafontdb_1.0     iterators_1.0.14    timeDate_4022.108  
[37] gower_1.0.1         globals_0.16.2      lme4_1.1-31         timechange_0.2.0   
[41] lifecycle_1.0.3     ncmeta_0.3.5        MASS_7.3-58.2       ipred_0.9-13       
[45] hms_1.1.2           parallel_4.2.2      TMB_1.9.2           rpart_4.1.19       
[49] stringi_1.7.12      foreach_1.5.2       e1071_1.7-13        lhs_1.1.6          
[53] boot_1.3-28.1       hardhat_1.2.0       lava_1.7.1          rlang_1.0.6        
[57] pkgconfig_2.0.3     lattice_0.20-45     labeling_0.4.2      bit_4.0.5          
[61] tidyselect_1.2.0    parallelly_1.34.0   magrittr_2.0.3      R6_2.5.1           
[65] generics_0.1.3      bootstrap_2019.6    DBI_1.1.3           pillar_1.8.1       
[69] withr_2.5.0         units_0.8-1         survival_3.5-3      nnet_7.3-18        
[73] future.apply_1.10.0 crayon_1.5.2        KernSmooth_2.23-20  utf8_1.2.3         
[77] tzdb_0.3.0          grid_4.2.2          digest_0.6.31       classInt_0.4-8     
[81] numDeriv_2016.8-1.1 GPfit_1.0-8         munsell_0.5.0     

Component(s)

R

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Brilliant! I had to download the file separately but this is fantastic.

library(tidyverse)
library(arrow)

# Download from:
# https://download849.mediafire.com/r4csstfcwwwgGquvCho4H6GtScoCJac108RL-q6X9MtoWuPDQvZOQAWhxQqlCjLj2RmsyzikhTZ0ijBElIAs5in5whbp-w/7dk60h8gnj4n1qj/bioargo_correction_b.parquet
file <- "~/Desktop/bioargo_correction_b.parquet"

bioargo <- read_parquet(file)

bioargo

pdf(tempfile())
bioargo |>
  group_by(takuse, date_time, n_prof) |>
  filter(pres == max(pres)) |>
  ggplot(aes(x = pres)) +
  geom_histogram(binwidth = 10, color = "white")
dev.off()

bioargo_dark_corrected <- bioargo |>
  group_by(takuse, date_time, n_prof) |>
  mutate(chla = chla - min(chla, na.rm = TRUE)) |>
  ungroup()

write_parquet(bioargo_dark_corrected, tempfile())

This reprex appears to crash R. See standard output and standard error for more details.

Standard output and error


 *** caught segfault ***
address 0x18, cause 'invalid permissions'

Traceback:
 1: Table__from_dots(dots, schema, option_use_threads())
 2: Table$create(x, schema = schema)
 3: as_arrow_table.data.frame(x)
 4: as_arrow_table(x)
 5: doTryCatch(return(expr), name, parentenv, handler)
 6: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7: tryCatchList(expr, classes, parentenv, handlers)
 8: tryCatch(as_arrow_table(x), arrow_no_method_as_arrow_table = function(e) {    abort("Object must be coercible to an Arrow Table using `as_arrow_table()`",         parent = e, call = caller_env(2))})
 9: as_writable_table(x)
10: write_parquet(bioargo_dark_corrected, tempfile())
11: eval(expr, envir, enclos)
12: eval(expr, envir, enclos)
13: eval_with_user_handlers(expr, envir, enclos, user_handlers)
14: withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers))
15: withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler)
16: doTryCatch(return(expr), name, parentenv, handler)
17: tryCatchOne(expr, names, parentenv, handlers[[1L]])
18: tryCatchList(expr, classes, parentenv, handlers)
19: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call, nlines = 1L)        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
20: try(f, silent = TRUE)
21: handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler))
22: timing_fn(handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler)))
23: evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos,     debug = debug, last = i == length(out), use_try = stop_on_error !=         2L, keep_warning = keep_warning, keep_message = keep_message,     output_handler = output_handler, include_timing = include_timing)
24: evaluate::evaluate(...)
25: evaluate(code, envir = env, new_device = FALSE, keep_warning = if (is.numeric(options$warning)) TRUE else options$warning,     keep_message = if (is.numeric(options$message)) TRUE else options$message,     stop_on_error = if (is.numeric(options$error)) options$error else {        if (options$error && options$include)             0L        else 2L    }, output_handler = knit_handlers(options$render, options))
26: in_dir(input_dir(), expr)
27: in_input_dir(evaluate(code, envir = env, new_device = FALSE,     keep_warning = if (is.numeric(options$warning)) TRUE else options$warning,     keep_message = if (is.numeric(options$message)) TRUE else options$message,     stop_on_error = if (is.numeric(options$error)) options$error else {        if (options$error && options$include)             0L        else 2L    }, output_handler = knit_handlers(options$render, options)))
28: eng_r(options)
29: block_exec(params)
30: call_block(x)
31: process_group.block(group)
32: process_group(group)
33: withCallingHandlers(if (tangle) process_tangle(group) else process_group(group),     error = function(e) {        setwd(wd)        cat(res, sep = "\n", file = output %n% "")        message("Quitting from lines ", paste(current_lines(i),             collapse = "-"), " (", knit_concord$get("infile"),             ") ")    })
34: process_file(text, output)
35: knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet)
36: rmarkdown::render(input, quiet = TRUE, envir = globalenv(), encoding = "UTF-8")
37: (function (input) {    rmarkdown::render(input, quiet = TRUE, envir = globalenv(),         encoding = "UTF-8")})(input = base::quote("loyal-rat_reprex.R"))
38: (function (what, args, quote = FALSE, envir = parent.frame()) {    if (!is.list(args))         stop("second argument must be a list")    if (quote)         args <- lapply(args, enquote)    .Internal(do.call(what, args, envir))})(base::quote(function (input) {    rmarkdown::render(input, quiet = TRUE, envir = globalenv(),         encoding = "UTF-8")}), base::quote(list(input = "loyal-rat_reprex.R")), envir = base::quote(<environment>),     quote = base::quote(TRUE))
39: do.call(do.call, c(readRDS("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-fun-768e1cc53b50"),     list(envir = .GlobalEnv, quote = TRUE)), envir = .GlobalEnv,     quote = TRUE)
40: saveRDS(do.call(do.call, c(readRDS("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-fun-768e1cc53b50"),     list(envir = .GlobalEnv, quote = TRUE)), envir = .GlobalEnv,     quote = TRUE), file = "/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",     compress = FALSE)
41: withCallingHandlers({    NULL    saveRDS(do.call(do.call, c(readRDS("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-fun-768e1cc53b50"),         list(envir = .GlobalEnv, quote = TRUE)), envir = .GlobalEnv,         quote = TRUE), file = "/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",         compress = FALSE)    flush(stdout())    flush(stderr())    NULL    invisible()}, error = function(e) {    {        callr_data <- as.environment("tools:callr")$`__callr_data__`        err <- callr_data$err        if (FALSE) {            assign(".Traceback", .traceback(4), envir = callr_data)            dump.frames("__callr_dump__")            assign(".Last.dump", .GlobalEnv$`__callr_dump__`,                 envir = callr_data)            rm("__callr_dump__", envir = .GlobalEnv)        }        e <- err$process_call(e)        e2 <- err$new_error("error in callr subprocess")        class(e2) <- c("callr_remote_error", class(e2))        e2 <- err$add_trace_back(e2)        cut <- which(e2$trace$scope == "global")[1]        if (!is.na(cut)) {            e2$trace <- e2$trace[-(1:cut), ]        }        saveRDS(list("error", e2, e), file = paste0("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",             ".error"))    }}, interrupt = function(e) {    {        callr_data <- as.environment("tools:callr")$`__callr_data__`        err <- callr_data$err        if (FALSE) {            assign(".Traceback", .traceback(4), envir = callr_data)            dump.frames("__callr_dump__")            assign(".Last.dump", .GlobalEnv$`__callr_dump__`,                 envir = callr_data)            rm("__callr_dump__", envir = .GlobalEnv)        }        e <- err$process_call(e)        e2 <- err$new_error("error in callr subprocess")        class(e2) <- c("callr_remote_error", class(e2))        e2 <- err$add_trace_back(e2)        cut <- which(e2$trace$scope == "global")[1]        if (!is.na(cut)) {            e2$trace <- e2$trace[-(1:cut), ]        }        saveRDS(list("error", e2, e), file = paste0("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",             ".error"))    }}, callr_message = function(e) {    try(signalCondition(e))})
42: doTryCatch(return(expr), name, parentenv, handler)
43: tryCatchOne(expr, names, parentenv, handlers[[1L]])
44: tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
45: doTryCatch(return(expr), name, parentenv, handler)
46: tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]),     names[nh], parentenv, handlers[[nh]])
47: tryCatchList(expr, classes, parentenv, handlers)
48: tryCatch(withCallingHandlers({    NULL    saveRDS(do.call(do.call, c(readRDS("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-fun-768e1cc53b50"),         list(envir = .GlobalEnv, quote = TRUE)), envir = .GlobalEnv,         quote = TRUE), file = "/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",         compress = FALSE)    flush(stdout())    flush(stderr())    NULL    invisible()}, error = function(e) {    {        callr_data <- as.environment("tools:callr")$`__callr_data__`        err <- callr_data$err        if (FALSE) {            assign(".Traceback", .traceback(4), envir = callr_data)            dump.frames("__callr_dump__")            assign(".Last.dump", .GlobalEnv$`__callr_dump__`,                 envir = callr_data)            rm("__callr_dump__", envir = .GlobalEnv)        }        e <- err$process_call(e)        e2 <- err$new_error("error in callr subprocess")        class(e2) <- c("callr_remote_error", class(e2))        e2 <- err$add_trace_back(e2)        cut <- which(e2$trace$scope == "global")[1]        if (!is.na(cut)) {            e2$trace <- e2$trace[-(1:cut), ]        }        saveRDS(list("error", e2, e), file = paste0("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",             ".error"))    }}, interrupt = function(e) {    {        callr_data <- as.environment("tools:callr")$`__callr_data__`        err <- callr_data$err        if (FALSE) {            assign(".Traceback", .traceback(4), envir = callr_data)            dump.frames("__callr_dump__")            assign(".Last.dump", .GlobalEnv$`__callr_dump__`,                 envir = callr_data)            rm("__callr_dump__", envir = .GlobalEnv)        }        e <- err$process_call(e)        e2 <- err$new_error("error in callr subprocess")        class(e2) <- c("callr_remote_error", class(e2))        e2 <- err$add_trace_back(e2)        cut <- which(e2$trace$scope == "global")[1]        if (!is.na(cut)) {            e2$trace <- e2$trace[-(1:cut), ]        }        saveRDS(list("error", e2, e), file = paste0("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",             ".error"))    }}, callr_message = function(e) {    try(signalCondition(e))}), error = function(e) {    NULL    if (TRUE) {        try(stop(e))    }    else {        invisible()    }}, interrupt = function(e) {    NULL    if (TRUE) {        e    }    else {        invisible()    }})
An irrecoverable exception occurred. R is aborting now ...

Biogeochemical Argo!!! (Cool to see it here…that’s what I did in my previous job!)

Since all these seem related to InferArrowType(), do you mind attempting to print out the column names and types of the data frame you’re passing to arrow::write_parquet()? You could do that maybe with str(the_data_frame_right_before_write_parquet[integer(0), ])?

Just wanted to chime in here since I’m experiencing a very similar error.

It happens when I’m writing a dataframe to an arrow/feather file. Like for OP it works in arrow 10 but not 11. And is originating from the same line.

Here is the top of the stacktrace running R with debugger attached:

* thread #1, name = 'R', stop reason = signal SIGSEGV: invalid address (fault address: 0x18)
  * frame #0: 0x00007fff89fb0516 arrow.so`arrow::r::InferArrowType(SEXPREC*) at shared_ptr_base.h:1522:7
    frame #1: 0x00007fff89fab73e arrow.so`arrow::r::InferSchemaFromDots(SEXPREC*, SEXPREC*, int, std::shared_ptr<arrow::Schema>&)::'lambda'(int, SEXPREC*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>)::operator()(int, SEXPREC*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) const at table.cpp:179:62
    frame #2: 0x00007fff89fac6c1 arrow.so`arrow::r::InferSchemaFromDots(SEXPREC*, SEXPREC*, int, std::shared_ptr<arrow::Schema>&) at arrow_types.h:211:15
    frame #3: 0x00007fff89f413aa arrow.so`Table__from_dots(SEXPREC*, SEXPREC*, bool) at r_to_arrow.cpp:1461:44
    frame #4: 0x00007fff89e9a690 arrow.so`_arrow_Table__from_dots at arrowExports.cpp:4321:40

It happens during unit testing in my package so I can reproduce it as much as I want locally, and happens as well running github actions on ubuntu, windows and macOS. But unfortunately have not been able to create a minimum reproducible example. Just calling the function separately that gives the issue with the same inputs does not give the error. So it seems it is dependent on something that happens earlier during my unit testing.

I’ve tried turning of thread using options(arrow.use_threads = FALSE) but that doesn’t solve the issue.

Attached is my sessionInfo and arrowInfo. If you have any ideas on how to debug further please let me know.

SessionInfo R version 4.2.2 (2022-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.10

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.1

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=nl_NL.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] arrow_11.0.0.2

loaded via a namespace (and not attached): [1] tidyselect_1.2.0 bit_4.0.4 compiler_4.2.2 magrittr_2.0.3 assertthat_0.2.1 R6_2.5.1
[7] cli_3.6.0 tools_4.2.2 glue_1.6.2 rstudioapi_0.14 bit64_4.0.5 vctrs_0.5.1
[13] lifecycle_1.0.3 rlang_1.0.6 purrr_1.0.1

arrowInfo Arrow package version: 11.0.0.2

Capabilities:

dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc TRUE mimalloc TRUE

Memory:

Allocator jemalloc Current 0 bytes Max 0 bytes

Runtime:

SIMD Level avx2 Detected SIMD Level avx2

Build:

C++ Library Version 11.0.0 C++ Compiler GNU C++ Compiler Version 11.3.0