operator-sdk: Operator-sdk scorecard basic spec check occasionally produces timeout with no logs
Bug Report
operator-sdk v.1.14.0
scorecard-test v1.14.0
I’m from the DCI team. As a part of our daily cert suite, we regularly run operator-sdk scorecard --selector=test=basic-check-spec-test
for two operators: simple-demo-operator
and testpmd-operator
, 10 tests per day.
Normally, basic-check-spec-test
should be green for both. But it fails occasionally in 20% of cases for both operators in a row with timeout error error running tests context deadline exceeded
. To increase timeout with wait-time
up to 300s doesn’t help. Also, the test is always failing for both operators in a row and it looks like a 10-20 min of some internal API.
To have more information, could you please add more information in the logs about where did exactly the timeout happen?
What did you do?
operator_sdk scorecard \
--output json \
--selector=test=basic-check-spec-test \
--kubeconfig {{ kubeconfig_path }} \
--namespace scorecard-testing \
--service-account default \
--config {{ scorecard_config_path }} \
--verbose \
--wait-time 300s \
{{ scorecard_operator_dir }}
What did you expect to see?
The results of basic-check-spec-test
should be stable. In the case of timeout, it would be nice to have logs to identify what is the reason for this timeout.
What did you see instead? Under which circumstances?
Timeout in 20% of cases error running tests context deadline exceeded
with no detailed logs.
Environment
Operator type:
-
name: "testpmd-operator"
version: "v0.2.9"
image: "quay.io/rh-nfv-int/testpmd-operator-bundle@sha256:5e28f883faacefa847104ebba1a1a22ee897b7576f0af6b8253c68b5c8f42815"
index_image: "quay.io/tkrishtop/index-testpmd-operator-bundle:v0.2.9"
-
name: "simple-demo-operator"
version: "v0.0.3"
image: "quay.io/opdev/simple-demo-operator-bundle@sha256:eff7f86a54ef2a340dbf739ef955ab50397bef70f26147ed999e989cfc116b79"
index_image: "quay.io/opdev/simple-demo-operator-catalog:v0.0.3"
Kubernetes cluster type:
Happens randomly for the latest stable OCP 4.7, OCP 4.8, OCP 4.9, OCP 4.10
$ operator-sdk version
operator-sdk v.1.14.0
scorecard-test v1.14.0
Possible Solution
It would be nice to have more detailed logs to identify what is the reason for this timeout.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 1
- Comments: 22 (15 by maintainers)
Hi @acornett21, thank you for the information! Typically, we run the scorecard-sdk from preflight, which currently uses version 1.26.0. However, I do have the option to run the standalone scorecard-sdk, so perhaps I’ll activate it for our daily runs using version 1.28.0.
@tkrishtop Recently operator-sdk switched to using their own images for this cmd. So instead of using an image from
docker.io
and an image fromregistry.access.redhat.com
both images come fromquay
. This change is in1.28.0
you can see the PR here. Have you upgraded to1.28.0
yet?@tkrishtop we used 1.14 a year ago, when I wrote that comment 😃 These days we just retry the command 4 times and we’ve never seen the error since. I guess the cause was long image pulls, but it would be great to confirm this by actually having a useful error message.
FTR, I encountered a failure recently with verbose logging turned on. AFAICT the interesting part (i.e. what it is that’s actually timing out) is not getting logged, but posting here in case it’s useful to someone investigating this.