ipt: Slow validation for large dataset when publishing in IPT (67 M records - takes 10 days+)

This could be related to #1062 and currently is an issue in our production environment which hasn’t published updates for a while for the Artportalen data.

We’re currently attempting a workaround which avoids to use IPT for publishing and instead does this from the command line, after speaking to people in GBIF Norway who have used this approach in the past for their larger datasets when publishing and running into similar performance issues.

The problem we have is that the validation in IPT when publishing the Artportalen dataset is slow and takes more than 10 days and then doesn’t complete.

Using a separate clean box with the latest (dockerized) stable IPT version we compared and also get the same slow validation of the occurrenceID column for this dataset. When publishing from that IPT instance, a sorting of the OccurrenceID that calls out to gnu sort gradually also here grinds to a halt, becoming slower and slower as the number of records increase. It completes however, and doesn’t crash like our production deployment of IPT, but it its still too slow for practical use.

So we have some questions that came up when trying to work on this issue.

The first question is if and why the sorting on the occurrenceID is required? Does the sorting of the occurrenceID aim at making sure that the submitted archive is sorted? Is that a requirement for GBIF to process it? Or is it used as a step on the way towards verifying presence and uniqueness of occurrenceID identifiers?

Another question is related to disk space. When testing with gradually larger data ie using a “select * from artdata limit 1 million, 2 million, 5 million and 67 million” approach, we get exponentially increasing validation time as well as a lot of files on disk eating considerable space.

The disk space used is big - 5.7 G for the dwca.zip, 66 GB for the uncompressed archive’s tsv file and in addition to this also a whole lof of space also eaten by various file segments created by gnu sort during the sorting of the occurrenceID. It seems publication using the IPT of a 5.7G dwca.zip requires several hundreds of GBs of temp files currently.

With the existing IPT approach of calling out to gnu sort, would it be possible to pull out just the column for occurrenceIDs into a separate smaller file and sort it separately (if needed) to save on the temp files and disk space? Then perhaps all of the other columns wouldn’t need to be dragged around and serialized to disk? A separate later step calling “gnu join” with the sorted single column data comining it with the original file could perhaps be faster to create the new sorted archive, if it is required to have the whole archive sorted before submitting it?

I’m not sure what is causing the slowness and when testing validation of very similar data it looks like we should expect significantly faster sorting. Some tests and benchmarks are attached.

This in-memory approach in R checks for uniqueness and presence in about 2 seconds for 70 M strings of 41 characters, and uses less than 2 minutes for a radix sort:

check_occurrenceid.pdf

This bash approach with only three columns for 67 M strings uses files on disk and also takes a few minutes:

check_occurrenceid_bash.pdf

We haven’t been able to figure out where the slowness comes from but it seems to be exponential and the gnu sort would probably not be so slow if it didn’t have to shuffle around all of those columns that the Artportalen dataset has?

A test using Apache Spark, converting the Artportalen dwca-zip into a parquet file and running some validation tests also takes somewhere in the range of a few seconds up to at the most five minutes for hopefully equivalent validation tests, and while this approach uses less disk space - the parquet file is around 3.7 GB which is smaller than the dwca-zip - it should scale well to any size dataset (which the faster inmemory sort maybe wouldn’t).

Here is an attempt to provide what is needed to replicate this test using Apache Spark through the sparklyr library:

library(sparklyr)
library(dplyr)

# first install spark 2.40 hadoop 2.7 with sparklyr::spark_install()

Sys.setenv("SPARK_MEM" = "12g")

config <- spark_config()
config$`sparklyr.shell.driver-memory` <- '12G'
config$`sparklyr.shell.executor-memory` <- '4G'
config$sparklyr.defaultPackages <- "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3"
config$spark.cassandra.cassandra.host <- "localhost"
config$spark.driver.maxResultSize <- "4G"
config$spark.executor.cores <- 3

# is pushdown option TRUE?

sc <- spark_connect(master = "local", config = config)

# for this connection, load all records

system.time(
  spark_read_csv(sc, memory = FALSE,
    name = "artdata", path = "file:///home/roger/artdata/artdata.tsv", delimiter = "\t")
)

#user   system  elapsed 
#6.154    7.060 1559.874 

# generate a parquet file based on the dataframe above

system.time(
  spark_write_parquet(
    tbl(sc, "artdata"),  
    "file:///home/roger/artdata/artdata.parquet")
)

#user   system  elapsed 
#14.634   16.586 3816.375 

# the parquet-file is 3.8 GB on disk, smaller than the zip

spark_tbl_handle <- spark_read_parquet(sc, memory = FALSE,
  "artdata", "file:///home/roger/artdata/artdata.parquet")


has_valid_bor <- function() {
  
  bor <- 
    spark_tbl_handle %>%
    count(basisOfRecord) %>%
    collect() %>%
    mutate(is_ok = basisOfRecord %in% c(
        "humanobservation", 
        "machineobservation"
      )
    )

  bor %>% pull(is_ok) %>% all
}

n_rowcount <- function() {
  
  spark_tbl_handle %>%
  summarise(n = n()) %>%
  pull(n)

}


has_valid_id <- function() {
  
  ids <- 
    spark_tbl_handle %>%
    count(occurrenceID) %>%
    filter(n > 1, is.na(occurrenceID)) %>%
    collect()
  
  nrow(ids) == 0
  
}


system.time(
  has_valid_bor()
)

system.time(
  has_valid_id()
)

system.time(
  n_rowcount()
)

sort_artdata <- function() {
  spark_tbl_handle %>%
  arrange(occurrenceID) %>%
  head(10) %>%
  collect()
}

system.time(
  sort_artdata()
)

# sorting in spark takes about 5 minutes...
#user  system elapsed 
#3.182   1.370 282.698



src_tbls(sc)

#tbl_cache(sc, "artdata")

spark_disconnect(sc)

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 16 (9 by maintainers)

Commits related to this issue

Improve reporting/logging while generating DWCA. Useful for #1450. — committed to gbif/ipt by MattBlissett 5 years ago

Most upvoted comments

Slow validation of occurenceIds on docker image may be related to this? https://github.com/gbif/docker-ipt/pull/8

steinho on Jan 13, 2020