opensearch-build: [Bug]: gradle check failing with java heap OutOfMemoryError

Describe the bug

Public jenkins gradle check job failure due to java heap OutOfMemoryError. Raised this bug to get more understanding around existing gradle check function and prevent it on different machines. The ec2 hosts were previously upgraded to c524xlarge instance. Not sure if instance needs some cleanup.

To reproduce

https://build.ci.opensearch.org/job/gradle-check/381

Expected behavior

Job should not fail with

Screenshots

If applicable, add screenshots to help explain your problem.

Host / Environment

Running on EC2 (Amazon_ec2_cloud) - jenkinsAgentNode-Jenkins-Agent-Ubuntu2004-X64-c524xlarge-Single-Host (i-093f212ad4f5e9583) in /var/jenkins/workspace/gradle-check

Additional context

No response

Relevant log output

1: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':example-plugins:custom-settings:compileTestJava'.
> java.lang.OutOfMemoryError: Java heap space

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
==============================================================================

2: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':libs:opensearch-x-content:compileTestJava'.
> java.lang.OutOfMemoryError: Java heap space

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
==============================================================================

3: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':qa:repository-multi-version:compileTestJava'.
> java.lang.OutOfMemoryError: Java heap space

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
==============================================================================

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

@peterzhuamazon I think we need to switch gradle jobs to run increasingly with --no-daemon. If you want to hunt that down in various build.sh scripts the projects will gladly merge that change I imagine. But also I don’t see a problem with terminating agents after every large job. It gives a completely clean machine to every job. What’s the cost of recycling an agent?

Hi @dblock that is already done and that is not related here. We have been using this method since fork Jenkins.

Hi @dblock @dreamer-89 @bbarani

After this change gradle check generally complete between 27-36min, quicker than the original 45-60min. They also have less flaky runs in general. https://build.ci.opensearch.org/job/gradle-check/853/console https://build.ci.opensearch.org/job/gradle-check/854/console https://build.ci.opensearch.org/job/gradle-check/855/console

Even the failure is legit failure most of the time: https://build.ci.opensearch.org/job/gradle-check/850/console

Tho flaky test will occasionally show: https://build.ci.opensearch.org/job/gradle-check/856/console

This seems to me that gradle check have some zombie process / memory leak that cause the continuous flaky runs on the same runner. By restrict the runs to 1 on each brand new runner, this temporarily resolve the issue and increase the success rate.

Small sample size still but already show a different trend in success rate: image

I hear from @peterzhuamazon that the JDK may have been changed? I would double check that G1GC is enabled.

Are these instances recycled every build?

I am currently writing a setup to permanently recycle all the instances. AKA run 1 build, delete agent, provision new, run again while apply all cleanups for now. If this runs well with higher success rate then it probably caused by some zombie process in the middle.