kubernetes: gce-master-scale-correctness - test cases not showing up on testgrid tab

Which jobs are failing?

Since 8th JUL, the test cases don’t show up on testgrid: https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests=

Which tests are failing?

ci-kubernetes-e2e-gce-scale-correctness

It’s not clear from testgrid, which runs pass and which failed atm

Since when has it been failing?

2022-07-08

Testgrid link

https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests=

Reason for failure (if possible)

Since 8th JUL, the build-log and junit artifacts looks different.

Before, the junit file was split into parts: junit_{}.xml and it contains only a testcase data (name, time, skipped*, …), E.g: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1545015201953222656/artifacts/

From 07-08, there’s a single junit_01.xml file, which is 650MB+ and contains all the std-out and std-err informations, which is a duplication of build logs. E.g.: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1550813416132710400/artifacts/

Looking at the difference in kubernetes/kubernetes source code - there were number of changes to: ginkgo, klog and traces - all could be related: https://github.com/kubernetes/kubernetes/compare/2a017f94b...4569e646e

cc @chendave - I see you’ve made number of commits; would you be able to help?

Anything else we need to know?

No response

Relevant SIG(s)

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 50 (50 by maintainers)

Most upvoted comments

My guess is that 600MiB of junit exceeds some size limit in used in testgrid and for that reason it isn’t interpreted by testgrid.

@helayoty I don’t think this is a release blocker, the change needed in Kubernetes has been merged, meanwhile, I pushed a pr in testgrid to close this https://github.com/GoogleCloudPlatform/testgrid/pull/1055.

checked https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests= all testcase status should be back.

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1555161967272923136/artifacts/, the size now is about 2.5M.

But one thing should be noted is that, name of the testcases has been changed a little bit which is caused by the new format junit xml.

Here is one example, In V1, the testcase named as: “Kubernetes e2e suite.[sig-storage] Volumes NFSv4 should be mountable for NFSv4” now has been updated to: “Kubernetes e2e suite.[It] [sig-storage] Volumes NFSv4 should be mountable for NFSv4”

pls note the string of “[It]” that is added.

So, you can only check the status of “Kubernetes e2e suite.[It] [sig-storage] Volumes NFSv4 should be mountable for NFSv4” now.

We can somehow modify the report again to trim the string “[It]”, this should be easy, but I am not quite sure whether we need to make everything the same as before.

If you are okay with the new test name, I think we can close this issue, otherwise we can update the name as well.

@azylinski @pohly @aojea thoughts?

FYI this is CI job that is being run every day: https://prow.k8s.io/?job=ci-kubernetes-e2e-gce-scale-correctness It requires 5k nodes cluster to run and we have 1 gcp project shared with other scalability tests that can handle it so there is no easy way to manually trigger it.

hey all,

when running in parallel ginkgo v2 now merges the individual reports generated by each parallel process into one composite process. there is no way to turn this off (v1’s behavior was a shortcut at the time and somewhat ugly and confusing).

v2 also merges junit reports from multiple suites into a single file. this can be turned off with --keep-separate-reports

in v2 i also took a closer look at what few official-seeming junit specs exist and updated the reporter to match. currently there is no mechanism to control that default behavior.

you can, however, build a custom reporter to do some filtering/processing in-suite and for a project of k8s scope that might make the most sense.

for example - a reporter that modifies/filters the report object before manually calling GenerateJUnitReport or that builds the JUnitTestCases manually. I’d be happy to share more detail/help if that approach is preferred.

the solution seems to adapt kubetest

it seems kubetest dumps it entirely https://github.com/kubernetes/test-infra/blob/444b10105ab0b6356e3327a2100342bb782f2a57/kubetest/process/process.go#L66-L125 someone is up to adapt kubetest to either split the xml generated or prune the large messages in k/k?

I can try to take a shot after code freeze

In deck’s logs I see:

{"artifact":"https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1552625181665529856/artifacts/junit_01.xml", "component":"deck", "error":"file size over specified limit", "file":"k8s.io/test-infra/prow/spyglass/lenses/junit/lens.go:165", "func":"k8s.io/test-infra/prow/spyglass/lenses/junit.Lens.getJvd.func1", "level":"warning", "msg":"Error reading artifact"}

So at least for deck the reason for not showing junit is too large file.

I think testgrid implements similar logic probably.

The test is defined here: https://github.com/kubernetes/test-infra/blob/afcd7bf377861006c31a98cffe673939243741fe/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L4

We use kubekins.

My guess is that the old “junit shards” we see were generated by different ginkgo processes/threads (number of junit files matches --ginkgo-parallel=40 set in our test) and https://github.com/kubernetes/kubernetes/pull/109111 mentions that we the old parameter “–parallel” no longer works.

Probably we need to migrate kubekins to use ginkgo -procs=N instead of “–parallel”.