opensearch-build: [Bug]: gradle check failing with java heap OutOfMemoryError
Describe the bug
Public jenkins gradle check job failure due to java heap OutOfMemoryError. Raised this bug to get more understanding around existing gradle check function and prevent it on different machines. The ec2 hosts were previously upgraded to c524xlarge instance. Not sure if instance needs some cleanup.
To reproduce
https://build.ci.opensearch.org/job/gradle-check/381
Expected behavior
Job should not fail with
Screenshots
If applicable, add screenshots to help explain your problem.
Host / Environment
Running on EC2 (Amazon_ec2_cloud) - jenkinsAgentNode-Jenkins-Agent-Ubuntu2004-X64-c524xlarge-Single-Host (i-093f212ad4f5e9583) in /var/jenkins/workspace/gradle-check
Additional context
No response
Relevant log output
1: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':example-plugins:custom-settings:compileTestJava'.
> java.lang.OutOfMemoryError: Java heap space
* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
==============================================================================
2: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':libs:opensearch-x-content:compileTestJava'.
> java.lang.OutOfMemoryError: Java heap space
* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
==============================================================================
3: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':qa:repository-multi-version:compileTestJava'.
> java.lang.OutOfMemoryError: Java heap space
* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
==============================================================================
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (21 by maintainers)
Hi @dblock that is already done and that is not related here. We have been using this method since fork Jenkins.
Hi @dblock @dreamer-89 @bbarani
After this change gradle check generally complete between 27-36min, quicker than the original 45-60min. They also have less flaky runs in general. https://build.ci.opensearch.org/job/gradle-check/853/console https://build.ci.opensearch.org/job/gradle-check/854/console https://build.ci.opensearch.org/job/gradle-check/855/console
Even the failure is legit failure most of the time: https://build.ci.opensearch.org/job/gradle-check/850/console
Tho flaky test will occasionally show: https://build.ci.opensearch.org/job/gradle-check/856/console
This seems to me that gradle check have some zombie process / memory leak that cause the continuous flaky runs on the same runner. By restrict the runs to 1 on each brand new runner, this temporarily resolve the issue and increase the success rate.
Small sample size still but already show a different trend in success rate:
I am currently writing a setup to permanently recycle all the instances. AKA run 1 build, delete agent, provision new, run again while apply all cleanups for now. If this runs well with higher success rate then it probably caused by some zombie process in the middle.