karmada: Flaking test: timeout in resourceinterpreter test
Which jobs are flaking:
e2e test
Which test(s) are flaking:
resourceinterpreter_test
- https://github.com/karmada-io/karmada/actions/runs/5618350294/job/15223908711
- https://github.com/karmada-io/karmada/actions/runs/5549370069/job/15032350534
- https://github.com/karmada-io/karmada/actions/runs/5689554705/job/15421456465
- https://github.com/karmada-io/karmada/actions/runs/5688333820/job/15418190276
Reason for failure:
timeout
Anything else we need to know:
------------------------------
• [FAILED] [3.050 seconds]
Resource interpreter customization testing when Apply single ResourceInterpreterCustomization without DependencyInterpretation operation [JustBeforeEach] InterpreterOperation AggregateStatus testing AggregateStatus testing
[JustBeforeEach] /home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:588
[It] /home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:782
Timeline >>
STEP: Creating ResourceInterpreterCustomization(interpreter-customizationmtmwv) @ 07/21/23 04:22:46.042
[FAILED] in [JustBeforeEach] - /home/runner/work/karmada/karmada/test/e2e/framework/resourceinterpretercustomization.go:19 @ 07/21/23 04:22:49.09
<< Timeline
[FAILED] Unexpected error:
<*errors.StatusError | 0xc000eb4000>:
Internal error occurred: failed calling webhook "resourceinterpretercustomization.karmada.io": failed to call webhook: Post "[https://karmada-webhook.karmada-system.svc:443/validate-resourceinterpretercustomization?timeout=3s](https://karmada-webhook.karmada-system.svc/validate-resourceinterpretercustomization?timeout=3s)": context deadline exceeded
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "Internal error occurred: failed calling webhook \"resourceinterpretercustomization.karmada.io\": failed to call webhook: Post \"[https://karmada-webhook.karmada-system.svc:443/validate-resourceinterpretercustomization?timeout=3s\](https://karmada-webhook.karmada-system.svc/validate-resourceinterpretercustomization?timeout=3s\)": context deadline exceeded",
Reason: "InternalError",
Details: {
Name: "",
Group: "",
Kind: "",
UID: "",
Causes: [
{
Type: "",
Message: "failed calling webhook \"resourceinterpretercustomization.karmada.io\": failed to call webhook: Post \"[https://karmada-webhook.karmada-system.svc:443/validate-resourceinterpretercustomization?timeout=3s\](https://karmada-webhook.karmada-system.svc/validate-resourceinterpretercustomization?timeout=3s\)": context deadline exceeded",
Field: "",
},
],
RetryAfterSeconds: 0,
},
Code: 500,
},
}
occurred
In [JustBeforeEach] at: /home/runner/work/karmada/karmada/test/e2e/framework/resourceinterpretercustomization.go:19 @ 07/21/23 04:22:49.09
Full Stack Trace
github.com/karmada-io/karmada/test/e2e/framework.CreateResourceInterpreterCustomization.func1()
/home/runner/work/karmada/karmada/test/e2e/framework/resourceinterpretercustomization.go:19 +0x154
github.com/karmada-io/karmada/test/e2e/framework.CreateResourceInterpreterCustomization({0x44afb78, 0xc000411ea0}, 0xc000eac2c0)
/home/runner/work/karmada/karmada/test/e2e/framework/resourceinterpretercustomization.go:17 +0xfe
github.com/karmada-io/karmada/test/e2e.glob..func34.1.2()
/home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:589 +0x85
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
[SynchronizedAfterSuite] [INTERRUPTED] [0.187 seconds]
[SynchronizedAfterSuite]
/home/runner/work/karmada/karmada/test/e2e/suite_test.go:140
[INTERRUPTED] Interrupted by Other Ginkgo Process
In [SynchronizedAfterSuite] at: /home/runner/work/karmada/karmada/test/e2e/suite_test.go:140 @ 07/21/23 04:22:49.336
Full Stack Trace
------------------------------
Summarizing 2 Failures:
[FAIL] Resource interpreter customization testing when Apply single ResourceInterpreterCustomization without DependencyInterpretation operation [JustBeforeEach] InterpreterOperation AggregateStatus testing AggregateStatus testing
/home/runner/work/karmada/karmada/test/e2e/framework/resourceinterpretercustomization.go:19
[INTERRUPTED] [SynchronizedAfterSuite]
/home/runner/work/karmada/karmada/test/e2e/suite_test.go:140
Ran 117 of 145 Specs in 899.996 seconds
FAIL! - Interrupted by Other Ginkgo Process -- 116 Passed | 1 Failed | 0 Pending | 28 Skipped
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 15 (14 by maintainers)
I did two things:
ValidatingAdmissionHandleas above, in order to record time cost of 【all function、Decode step、List step、Validate step】.Then, it can achieve a 50% probability of reproducing this problem in CI, and I got effiective findings.
First of all, I would like to show you the most typical log results:
As you see, the
timeoutissue is actually caused byValidatingAdmissionHandle:Decode stepcost little time, can be ignore.List stepaccidentally cost more than 90% time of full function.Validate stepdidn’t execute since the function has returned.Although the
ResourceInterpreterCustomizationListis actuallynull, theListinvocation ofkube-apiserveris indeed sometimes too slow, which caused the timeout of this webhook,Besides, does the
Validatestep is time-consuming?The time cost of
Validatestep should be relatively stable, and I have collected some successful logs, which the time cost was:I subjectively feel that the timeout of webhook has nothing to do with the
Validatestep, or rather mainly caused byListstep.So, how to solve the problem? I have verified that if we change
v.List(ctx, configs)tov.List(context.TODO(), configs)can only preventListstep from timeout, however, the caller (kube-apiserver) indeed has timeout in client side, and it would close the connection and interrupt the call. You known I was saying, as long as theValidatingAdmissiontruly execute more than 3s, the webhook shall failed with timeout.Do you have some good solutions?
@XiShanYongYe-Chang @chaunceyjiang @liangyuanpeng @yike21
I’ve noticed that the similar issue has been around for a long time and we can see what others have discussed. Back here, one key difference is that it tries to access the apiserver in
pkg/webhook/resourceinterpretercustomization, and other webhooks only do validation of fields. We can keep track of the deeper reasons.@chaosi-zju Thanks for your excellent analysis.
From the analysis results, there are two ways to prevent timeout. One is to extend the waiting time, and the other is to remove the list operation.
The first method may be more direct, but users may also need to modify it in their own environment (which may depend on the container network access speed). In addition, we need to change it to an appropriate value. We have tried using 10 seconds before, and this time has been running steadily for a long time.
The second method is related to our verification logic and may be more difficult to modify, or even require modifying the API.
Therefore, from my personal perspective, I prefer directly extending the timeout period.