OpenSearch: Break down gradle check task to smaller maintainable verification tasks to improve flaky test failures

Today upon pull request, a gradle check verification task is started which runs unit test and integration tests along with other verification tasks. Gradle check run itself takes a lot of time (~30min on c5.18xlarge instances) consistently. This time increases exponentially when smaller machines are used for tests and is the reason for moving away from GHA for checks. This check fails quite often for a variety of reasons. Keeping the existing Gradle check as verification task is not sustainable in and thus needs to be divided into smaller logical task groups.

Specifically below work is needed:

  • Break down gradle check to logical tasks groups. This is needed because of
    • Individual runs takes a lot of time
    • Difficult to debug failures for new developers
    • Not possible to repro on developer machines.
  • Move individual group tasks into Github actions targeting <1hr runtime for each task. There is on-going work in this direction but due to complexity, runtime h/w needs of gradle check run, non-GHA options are considered
  • Evaluate macOS runners which provide better compute (2-core 7GB Ubuntu Vs 3 core 14GB macOS) virtual machines for resource intensive tests.

Exit Criteria:

  1. Final set of jobs should take less time compared to existing gradle check
  2. The final set of jobs should cover same or better test coverage.
  3. It should still be possible to run ./gradlew check with same existing test coverage.

Describe alternatives you’ve considered https://github.com/opensearch-project/OpenSearch/issues/2496

References:

  1. https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources
  2. https://github.com/opensearch-project/opensearch-infra/issues/136

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

@dreamer-89 can we close this in lieu of meta issue #4053 ?

Thanks for more information. @dreamer-89 May be lets organize this (if we haven’t already) and link them back to 1 meta issue:

  1. Take care of flaky tests
  2. Proposal to split gradle check in to multiple smaller tasks.
  3. Add support for new tests in CI (starting with Packaging tests)

@dreamer-89 Correction: it does not need to be multiple workflow, but multiple container within each run.