fhir-data-pipes: OutOfMemoryError when running batch pipeline for a large dataset
@kimaina has reported that running the JDBC batch pipeline for extracting all Encounters from AMPATH DB caused:
ERROR - DirectTransformExecutor.run(134) |2021-03-07T12:35:58,306| Error occurred within [!!!org.apache.beam.runners.direct.DirectTransformExecutor@1802183f=>java.lang.OutOfMemoryError:Java heap space!!!]
java.lang.OutOfMemoryError: Java heap space
This is not expected because more resources should not increase memory requirements of the pipeline significantly. However it is possible that the Parquet file generation may need more memory because of Parquet file structure. We may need to tweak some Parquet file parameters, like the page size or compression methods.
This is hard to reproduce (without access to the data) but some sort of OOM happens if we set the Java heap size to too low with -Xmx, e.g., -Xmx2G fails for fetching all Observations of the test big DB. This seems to happen at the very end, after all Observations are fetched from the server. I have tried the same thing with JSON output and it succeeds, although it is significantly slower (>4x) than not setting a memory limit.
One temporary fix is to prioritize working on #128 and fetch resources for AMPATH in date chunks (we should probably do this anyway for a DB of AMPATH’s size if everything should happen on a single machine).
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 21
Commits related to this issue
- Added JSON output option with some refactoring. (#137) This is related to issue #136 — committed to google/fhir-data-pipes by bashir2 3 years ago
- Added an option to flush output files in chunks (#149) Improves #136 — committed to google/fhir-data-pipes by bashir2 3 years ago
- Fused Parquet file generation of resource fetching steps (#163) This should fix #160 and #136. — committed to google/fhir-data-pipes by bashir2 3 years ago
Thank you @bashir2 using
java -Xmx5024Mresolved this issue. I will also provide the result ofjava -XX:+PrintFlagsFinal -version | grep HeapSizeonce I get access to the server. For now, closing this issue!