fhir-data-pipes: OutOfMemoryError when running batch pipeline for a large dataset

@kimaina has reported that running the JDBC batch pipeline for extracting all Encounters from AMPATH DB caused:

ERROR - DirectTransformExecutor.run(134) |2021-03-07T12:35:58,306| Error occurred within [!!!org.apache.beam.runners.direct.DirectTransformExecutor@1802183f=>java.lang.OutOfMemoryError:Java heap space!!!]
java.lang.OutOfMemoryError: Java heap space

This is not expected because more resources should not increase memory requirements of the pipeline significantly. However it is possible that the Parquet file generation may need more memory because of Parquet file structure. We may need to tweak some Parquet file parameters, like the page size or compression methods.

This is hard to reproduce (without access to the data) but some sort of OOM happens if we set the Java heap size to too low with -Xmx, e.g., -Xmx2G fails for fetching all Observations of the test big DB. This seems to happen at the very end, after all Observations are fetched from the server. I have tried the same thing with JSON output and it succeeds, although it is significantly slower (>4x) than not setting a memory limit.

One temporary fix is to prioritize working on #128 and fetch resources for AMPATH in date chunks (we should probably do this anyway for a DB of AMPATH’s size if everything should happen on a single machine).

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 21

Commits related to this issue

Most upvoted comments

Thank you @bashir2 using java -Xmx5024M resolved this issue. I will also provide the result of java -XX:+PrintFlagsFinal -version | grep HeapSize once I get access to the server. For now, closing this issue!