karmada: Flaking test: timeout in resourceinterpreter test

Which jobs are flaking:

e2e test

Which test(s) are flaking:

resourceinterpreter_test

Reason for failure:

timeout

Anything else we need to know:

------------------------------
• [FAILED] [3.050 seconds]
Resource interpreter customization testing when Apply single ResourceInterpreterCustomization without DependencyInterpretation operation [JustBeforeEach] InterpreterOperation AggregateStatus testing AggregateStatus testing
  [JustBeforeEach] /home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:588
  [It] /home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:782

  Timeline >>
  STEP: Creating ResourceInterpreterCustomization(interpreter-customizationmtmwv) @ 07/21/23 04:22:46.042
  [FAILED] in [JustBeforeEach] - /home/runner/work/karmada/karmada/test/e2e/framework/resourceinterpretercustomization.go:19 @ 07/21/23 04:22:49.09
  << Timeline

  [FAILED] Unexpected error:
      <*errors.StatusError | 0xc000eb4000>: 
      Internal error occurred: failed calling webhook "resourceinterpretercustomization.karmada.io": failed to call webhook: Post "[https://karmada-webhook.karmada-system.svc:443/validate-resourceinterpretercustomization?timeout=3s](https://karmada-webhook.karmada-system.svc/validate-resourceinterpretercustomization?timeout=3s)": context deadline exceeded
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Internal error occurred: failed calling webhook \"resourceinterpretercustomization.karmada.io\": failed to call webhook: Post \"[https://karmada-webhook.karmada-system.svc:443/validate-resourceinterpretercustomization?timeout=3s\](https://karmada-webhook.karmada-system.svc/validate-resourceinterpretercustomization?timeout=3s\)": context deadline exceeded",
              Reason: "InternalError",
              Details: {
                  Name: "",
                  Group: "",
                  Kind: "",
                  UID: "",
                  Causes: [
                      {
                          Type: "",
                          Message: "failed calling webhook \"resourceinterpretercustomization.karmada.io\": failed to call webhook: Post \"[https://karmada-webhook.karmada-system.svc:443/validate-resourceinterpretercustomization?timeout=3s\](https://karmada-webhook.karmada-system.svc/validate-resourceinterpretercustomization?timeout=3s\)": context deadline exceeded",
                          Field: "",
                      },
                  ],
                  RetryAfterSeconds: 0,
              },
              Code: 500,
          },
      }
  occurred
  In [JustBeforeEach] at: /home/runner/work/karmada/karmada/test/e2e/framework/resourceinterpretercustomization.go:19 @ 07/21/23 04:22:49.09

  Full Stack Trace
    github.com/karmada-io/karmada/test/e2e/framework.CreateResourceInterpreterCustomization.func1()
    	/home/runner/work/karmada/karmada/test/e2e/framework/resourceinterpretercustomization.go:19 +0x154
    github.com/karmada-io/karmada/test/e2e/framework.CreateResourceInterpreterCustomization({0x44afb78, 0xc000411ea0}, 0xc000eac2c0)
    	/home/runner/work/karmada/karmada/test/e2e/framework/resourceinterpretercustomization.go:17 +0xfe
    github.com/karmada-io/karmada/test/e2e.glob..func34.1.2()
    	/home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:589 +0x85
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
[SynchronizedAfterSuite] [INTERRUPTED] [0.187 seconds]
[SynchronizedAfterSuite] 
/home/runner/work/karmada/karmada/test/e2e/suite_test.go:140

  [INTERRUPTED] Interrupted by Other Ginkgo Process
  In [SynchronizedAfterSuite] at: /home/runner/work/karmada/karmada/test/e2e/suite_test.go:140 @ 07/21/23 04:22:49.336

  Full Stack Trace
------------------------------

Summarizing 2 Failures:
  [FAIL] Resource interpreter customization testing when Apply single ResourceInterpreterCustomization without DependencyInterpretation operation [JustBeforeEach] InterpreterOperation AggregateStatus testing AggregateStatus testing
  /home/runner/work/karmada/karmada/test/e2e/framework/resourceinterpretercustomization.go:19
  [INTERRUPTED] [SynchronizedAfterSuite] 
  /home/runner/work/karmada/karmada/test/e2e/suite_test.go:140

Ran 117 of 145 Specs in 899.996 seconds
FAIL! - Interrupted by Other Ginkgo Process -- 116 Passed | 1 Failed | 0 Pending | 28 Skipped

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 15 (14 by maintainers)

Most upvoted comments

image

I did two things:

  • I tried to add logs to ValidatingAdmissionHandle as above, in order to record time cost of 【all function、Decode step、List step、Validate step】.
  • I modified the timeout of this webhook call from 3s to 1s.

Then, it can achieve a 50% probability of reproducing this problem in CI, and I got effiective findings.

First of all, I would like to show you the most typical log results

[...validating.go:34] [PrintTimeCost] Decode cost: 5.660963ms
[...validating.go:39] Validating ResourceInterpreterCustomization(interpreter-customizationfj8qt) for request: CREATE
[...validating.go:44] [PrintTimeCost] List failed, cost: 953.075206ms, err: Timeout: failed waiting for *v1alpha1.ResourceInterpreterCustomization Informer to sync
[...validating.go:48] List len: 0, result: &{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink: ResourceVersion: Continue: RemainingItemCount:<nil>} Items:[]}, err2: <nil>, time cost: 104.914156ms
[...common.go:71] [PrintTimeCost] ValidatingAdmissionHandle cost 1.067382167s

As you see, the timeout issue is actually caused by ValidatingAdmissionHandle:

  • Decode step cost little time, can be ignore.
  • List step accidentally cost more than 90% time of full function.
  • Validate step didn’t execute since the function has returned.

Although the ResourceInterpreterCustomizationList is actually null, the List invocation of kube-apiserver is indeed sometimes too slow, which caused the timeout of this webhook,

Besides, does the Validate step is time-consuming?

The time cost of Validate step should be relatively stable, and I have collected some successful logs, which the time cost was:

I subjectively feel that the timeout of webhook has nothing to do with the Validate step, or rather mainly caused by List step.

So, how to solve the problem? I have verified that if we change v.List(ctx, configs) to v.List(context.TODO(), configs) can only prevent List step from timeout, however, the caller (kube-apiserver) indeed has timeout in client side, and it would close the connection and interrupt the call. You known I was saying, as long as the ValidatingAdmission truly execute more than 3s, the webhook shall failed with timeout.

Do you have some good solutions?

@XiShanYongYe-Chang @chaunceyjiang @liangyuanpeng @yike21

I’ve noticed that the similar issue has been around for a long time and we can see what others have discussed. Back here, one key difference is that it tries to access the apiserver in pkg/webhook/resourceinterpretercustomization, and other webhooks only do validation of fields. We can keep track of the deeper reasons.

@chaosi-zju Thanks for your excellent analysis.

From the analysis results, there are two ways to prevent timeout. One is to extend the waiting time, and the other is to remove the list operation.

The first method may be more direct, but users may also need to modify it in their own environment (which may depend on the container network access speed). In addition, we need to change it to an appropriate value. We have tried using 10 seconds before, and this time has been running steadily for a long time.

The second method is related to our verification logic and may be more difficult to modify, or even require modifying the API.

Therefore, from my personal perspective, I prefer directly extending the timeout period.