arrow: [R][R Shiny] Deployed/published Rshiny app using arrow to query an AWS-hosted dataset causes intermittent "stack imbalance", "segfault", "memory not mapped" errors.

Describe the bug, including details regarding any error messages, version, and platform.

Issue description:

I have an RShiny app that pulls data from a hive-partitioned dataset hosted in a private AWS bucket using the R arrow package. The dataset contains air quality data. It is ~30million rows in total, and partitioned by site and pollutant. One of the tabs in the app that utilizes the dataset allows a user to choose a site and pollutant to display in a timeseries, prompting data collection from the dataset using dplyr+arrow to execute a query. Each site/parameter combination requires only ~100k rows for visualization.

Currently, this app is hosted on Rstudio Connect, though the issue also occurs on shinyapps.io. The error occurs only when the app is deployed/published to a server– there are no issues when the app is running locally, even if the app is left idle for a long time.

Eventually, the app will crash with the errors “Warning: stack imbalance …”, then “caught segfault” and “memory not mapped” when a user selects an option that kicks off a query from the AWS-hosted dataset. The period of time that the app functions after being reset varies, sometimes crashing on the first click after opening the app and other times crashing only after a few minutes. If left open long enough, the deployed app will always return this error on an action that prompts a data pull from the AWS-hosted dataset.

Example error messages:

2022/12/20 20:48:33.816768370 Warning: stack imbalance in '$', 325 then 328
2022/12/20 20:48:33.816831632 Warning: stack imbalance in '<<-', 328 then 330
2022/12/20 20:48:33.816846663 
2022/12/20 20:48:33.816848493  *** caught segfault ***
2022/12/20 20:48:33.816938465 address 0x2c, cause 'memory not mapped'
2022/12/20 20:48:33.816944956 
2022/12/20 20:48:33.816963716  *** caught segfault ***
2022/12/20 20:48:33.816965356 address 0xe, cause 'memory not mapped'
2022/12/20 20:48:33.816975686 
2022/12/20 20:48:33.816976807  *** caught segfault ***
2022/12/20 20:48:33.816987197 address 0x30, cause 'memory not mapped'
2022/12/20 20:48:33.817012408 Warning: stack imbalance in '$', 244 then 259

Troubleshooting steps:

I suspected that the process maintaining the connection between AWS and the server was idling/diconnecting, and used a reactivePoll to collect from the dataset every 30 seconds (see below) to prevent the process from idling. This minimal collection in the reactivepoll is always successful, even after up to 15 minutes of running the app. However, this does not prevent the error from occurring when attempting to access the time series tab.

  log_db <- reactivePoll(30000, session, #reactivePoll inside the server
                         # Check for maximum month
                         checkFunc = function() {
                           print(ds %>% 
                             filter(sitecode == 'abc',
                                    pollutant == 'def') %>%
                             collect() %>%
                             nrow())

                           ds %>% 
                             filter(sitecode == 'abc',
                                    pollutant == 'def') %>%
                             collect() %>%
                             nrow()
                         },
                         valueFunc = function() {
                         paste0("  ")
                         }
  )
  
  # With a corresponding  outputText('connstr') in the ui
  output$connstr <- renderText(log_db())

REPREX:

This work is part of a project that can’t yet be shared publicly, but a reproducible example of a similar, if not the same, issue is available in this stackoverflow post: https://stackoverflow.com/questions/73654587/how-can-i-use-r-arrow-and-aws-s3-in-a-shiny-app-deployed-on-ec2-with-shinyproxy

My basic app setup is below:

library(arrow)
library(tidyverse)
library(shiny)
library(shinycssloaders)
library(tidyverse)
library(sonomaDashboardUI)
library(shinyWidgets)
library(aws.s3)
library(shinyjs)
library(shinybusy)

db_uri <- paste0('s3://','<bucketname>')
ds <- arrow::open_dataset(db_uri, format = 'arrow', unify_schemas = F)

ui <- fluidPage(
    tabSetPanel('tabs',
        tabPanel('tab that uses dataset',
            fluidRow(selectInput('sitecode', 'Site', choices = c(...))),
            fluidRow(selectInput('pollutant', 'Pollutant', choices = c(...))),
            fluidRow(plotOutput('plotThatUsesDS')),
        tabPanel('tab that does not use dataset',
            fluidRow(plotOutput('plotThatDoesNotUseDS')))

server <- function(session, input, output) {
    plot_ds_data <- reactive({
                                            ds %>% 
                                            filter(site == input$site, pollutant == input$pollutant) %>%
                                            collect()
    })

    output$plotThatUsesDS <- renderPlot({ ggplot(plot_ds_data()) + geom_line(...) })
}

SessionInfo:

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] shinybusy_0.3.1              shinyjs_2.1.0                aws.s3_0.3.21               
 [4] shinyWidgets_0.7.2           sonomaDashboardUI_0.0.0.9000 forcats_0.5.1               
 [7] stringr_1.4.0                dplyr_1.0.9                  purrr_0.3.4                 
[10] readr_2.1.2                  tidyr_1.2.0                  tibble_3.1.8                
[13] ggplot2_3.4.0                tidyverse_1.3.2              shinycssloaders_1.0.0       
[16] shiny_1.7.3                  arrow_10.0.1                

loaded via a namespace (and not attached):
  [1] googledrive_2.0.0          colorspace_2.0-3           deldir_1.0-6               ellipsis_0.3.2            
  [5] gghighlight_0.4.0          leaflet_2.1.1              snakecase_0.11.0           base64enc_0.1-3           
  [9] fs_1.5.2                   rstudioapi_0.13            hexbin_1.28.2              DT_0.24                   
 [13] bit64_4.0.5                fansi_1.0.3                lubridate_1.8.0            xml2_1.3.3                
 [17] splines_4.2.2              R.methodsS3_1.8.2          cachem_1.0.6               jsonlite_1.8.0            
 [21] openair_2.10-0             broom_1.0.1                cluster_2.1.4              dbplyr_2.2.1              
 [25] png_0.1-7                  PerformanceAnalytics_2.0.4 R.oo_1.25.0                mapproj_1.2.8             
 [29] compiler_4.2.2             httr_1.4.3                 backports_1.4.1            assertthat_0.2.1          
 [33] Matrix_1.5-1               fastmap_1.1.0              gargle_1.2.1               cli_3.4.1                 
 [37] later_1.3.0                htmltools_0.5.3            tools_4.2.2                gtable_0.3.0              
 [41] glue_1.6.2                 maps_3.4.0                 Rcpp_1.0.9                 cellranger_1.1.0          
 [45] jquerylib_0.1.4            styler_1.7.0               vctrs_0.5.1                nlme_3.1-160              
 [49] crosstalk_1.2.0            rvest_1.0.3                mime_0.12                  lifecycle_1.0.3           
 [53] renv_0.16.0                googlesheets4_1.0.1        MASS_7.3-58.1              zoo_1.8-10                
 [57] scales_1.2.0               hms_1.1.1                  promises_1.2.0.1           RColorBrewer_1.1-3        
 [61] quantmod_0.4.20            curl_4.3.2                 aws.signature_0.6.0        gridExtra_2.3             
 [65] sass_0.4.2                 latticeExtra_0.6-30        stringi_1.7.8              TTR_0.24.3                
 [69] rlang_1.0.6                pkgconfig_2.0.3            lattice_0.20-45            htmlwidgets_1.5.4         
 [73] bit_4.0.4                  tidyselect_1.1.2           magrittr_2.0.3             R6_2.5.1                  
 [77] generics_0.1.3             DBI_1.1.3                  pillar_1.8.0               haven_2.5.1               
 [81] withr_2.5.0                mgcv_1.8-41                xts_0.12.1                 tidyquant_1.0.4           
 [85] janitor_2.1.0              modelr_0.1.9               crayon_1.5.1               Quandl_2.11.0             
 [89] interp_1.1-3               utf8_1.2.2                 tzdb_0.3.0                 viridis_0.6.2             
 [93] jpeg_0.1-9                 grid_4.2.2                 readxl_1.4.1               reprex_2.0.2              
 [97] digest_0.6.29              xtable_1.8-4               R.cache_0.16.0             httpuv_1.6.5              
[101] R.utils_2.12.0             munsell_0.5.0              viridisLite_0.4.0          bslib_0.4.0               
[105] quadprog_1.5-8            

Component(s)

R

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

Thats kind of surprising to me it takes the long to open the dataset, I suspect all the partitioning is causing issues. I was able to replicate the issue you faced using the same partitioning structure (I just faked 24x times the data with different intervals). Then I tried saving all of the data in a single parquet file (its about 1gb) and now it runs in <0.5sec instead of 8-9seconds.

There is a good writeup about partitioning performance here: https://arrow.apache.org/docs/r/articles/dataset.html#partitioning-performance-considerations

Could you please try saving the data with less partitioning?

For example, I wrote the data like the code below so it partitions just on aqs_sitecode:

write_dataset(dplyr::group_by(ozone, aqs_sitecode), 
              path = db_uri, 
              format = 'parquet')

Then in the app:

arrow::open_dataset(db_uri, format = 'parquet') %>%
      filter(aqs_sitecode == s,
             parameter == 'Ozone',
             sample_duration == '1 HOUR',
             poc == 1) %>%
      select(date_time2, sample_measurement) %>%
      collect()

@cgostic Thank you for posting some code and data to try and replicate. I was able to run into similar segfault issues without including the reactivePoll() piece. Most of the time, when we see segfault memory-related issues it is not on the Connect side but on the application side, which makes me think it’s related to the arrow::open_dataset() function call being outside the server.

Could you please try moving it into the server in the reactive statement? When I made the switch and redeployed I have been unable to run into any of the memory errors. It seems having the dataset open and idle causes the issue. For reference the code I am using is below:

# See attached lockfile for package versions
library(shiny)
library(dplyr)
library(ggplot2)
library(htmltools)
library(arrow)
library(aws.s3)

aqs_site_code_unique <- c(51190007L, 60658001L, 60731022L, 100032004L, 110010043L, 120110034L, 
                          120573002L, 130890002L, 170314201L, 180970078L, 295100085L, 371190041L, 
                          371830014L, 420030008L, 440071010L, 510870014L, 20900034L, 40191028L, 
                          60270002L, 60850005L, 121290001L, 150030010L, 170191001L, 191630015L, 
                          230090103L, 300490004L, 310550019L, 340130003L, 360551007L, 380150003L, 
                          380171004L, 391351001L, 470090101L, 10730023L, 40139997L, 60371103L, 
                          60670006L, 202090021L, 260810020L, 270031002L, 320030540L, 330150018L, 
                          390350060L, 390610040L, 410510080L, 421010048L, 471570075L, 482011039L, 
                          490353006L, 530330080L, 60190011L, 80310026L, 90050005L, 90090027L, 
                          160010010L, 220330009L, 240230002L, 240330030L, 250250042L, 280490020L, 
                          330115001L, 350010023L, 360810124L, 361010003L, 401431127L, 450790007L, 
                          481410044L, 500070007L, 530090013L, 540390020L, 560210100L, 720210010L, 
                          320310031L, 400019009L)

ui <- fluidPage(
  fluidRow(column(3,
                  selectInput('sitecode',
                              label = 'Select Site',
                              choices = aqs_site_code_unique,
                              selected = NULL)),
           column(2,
                  div(style = 'padding-top:26px',
                      actionButton('go', 'Create Plot', width = '100%')))),
  fluidRow(plotOutput('TS')),
  
)

server <- function(input, output, session) {
  
  selected_site <- reactiveValues(sitecode = NULL)
  
  observeEvent(input$go, {
    selected_site$sitecode <- input$sitecode
  })
  
  plot_data <- reactive({
    req(selected_site$sitecode)
    
    s <- as.integer(selected_site$sitecode)
  
    bname <- 'BUCKETNAMEHERE'
    db_uri <- paste0('s3://', bname)
    ds <- arrow::open_dataset(db_uri, format = 'arrow', unify_schemas = F)
    ds %>%
      filter(parameter == 'Ozone',
             sample_duration == '1 HOUR',
             poc == 1) %>%
      select(aqs_sitecode, date_time2, sample_measurement) %>%
      collect()
  })
  
  output$TS <- renderPlot({
    req(selected_site$sitecode, 
        is.data.frame(plot_data()))
    
    ggplot(subset(plot_data(), aqs_sitecode == selected_site$sitecode)) +
      geom_line(aes(date_time2, sample_measurement)) +
      scale_x_datetime() +
      labs(x = 'DateTime', y = 'Ozone in ppb', main = selected_site$sitecode)
  })
}

shinyApp(ui = ui, server = server)

EDIT: I had commented out the reactivePoll({}) to ping the datasource every 30 seconds. With this included, the issue is resolved in the reprex! Great suggestion.

I will try this solution on a larger scale and let you know if it’s viable.