kyverno: [Bug] Update Requests stuck in pending/fail infinite loop when deleting k8s resources too quickly
Kyverno Version
1.10.0
Kubernetes Version
1.24.x
Kubernetes Platform
EKS
Kyverno Rule Type
Mutate
Description
URs get stuck in a pending/fail infinite loop when our test script from a CI/CD pipeline does the following:
- Creates a temporary test namespace and test ingresses in the cluster
- Uses kubectl and jq to check that annotations have been applied to the cluster by Kyverno
- After jq confirms the annotations have been applied by Kyverno, the script immediately deletes the test namespace and test ingresses from the cluster
- When I go to the cluster and check the UR’s, I find that they are stuck in a pending/fail infinite loop, with errors like the below
Demo of the infinite pending/fail loop:
Even though I am concerned with EKS, I was able to reproduce this bug with an exaggeratedly simple set up on my local machine using rancher’s k3s as well.
Here is the sample MutateExsiting
policy, with sensitive values removed:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: access-logs
spec:
validationFailureAction: Enforce
mutateExistingOnPolicyUpdate: true
rules:
- name: reconcile-load-balancer-annotations
context:
- name: nestedMetadata
variable:
value: "request.object.metadata"
- name: ingNamespace
variable:
jmesPath: "{{ nestedMetadata }}.namespace"
default: ""
- name: ingName
variable:
jmesPath: "{{ nestedMetadata }}.name"
default: ""
- name: lbAttributes
variable:
jmesPath: "{{ nestedMetadata }}.annotations.\"alb.ingress.kubernetes.io/load-balancer-attributes\""
default: ""
- name: s3Bucket
variable:
value: "access_logs.s3.bucket=test-bucket"
- name: s3Enabled
variable:
value: "access_logs.s3.enabled=true"
- name: s3Prefix
variable:
value: "access_logs.s3.prefix=test-prefix"
match:
any:
- resources:
kinds:
- Ingress
mutate:
targets:
- apiVersion: networking.k8s.io/v1
kind: Ingress
name: "{{ ingName }}"
namespace: "{{ ingNamespace }}"
preconditions:
all:
- key: "{{
lbAttributes |
(
contains(@, '{{ s3Bucket }}') &&
contains(@, '{{ s3Enabled }}') &&
contains(@, '{{ s3Prefix }}')
)
}}"
operator: NotEquals
value: true
patchStrategicMerge:
metadata:
annotations:
alb.ingress.kubernetes.io/load-balancer-attributes: "{{
lbAttributes |
(
contains(@, '{{ s3Bucket }}') && @ ||
(
contains(@,'access_logs.s3.bucket') &&
regex_replace_all_literal('access_logs.s3.bucket[^,]+|$]', @, '{{ s3Bucket }}') ||
(
length(@) > `0` &&
join(',', ['{{ s3Bucket }}',@]) ||
'{{ s3Bucket }}'
)
)
) |
(
contains(@, '{{ s3Prefix }}') && @ ||
(
contains(@,'access_logs.s3.prefix') &&
regex_replace_all_literal('access_logs.s3.prefix[^,]+|$]', @, '{{ s3Prefix }}') ||
join(',', ['{{ s3Prefix }}',@])
)
) |
(
contains(@, '{{ s3Enabled }}') && @ ||
(
contains(@,'access_logs.s3.enabled') &&
regex_replace_all_literal('access_logs.s3.enabled[^,]+|$]', @, '{{ s3Enabled}}') ||
join(',', ['{{ s3Enabled }}',@])
)
)
}}"
Steps to reproduce
- Make sure Kyverno v1.10.0 is up and running on Kubernetes 1.24.x
- Make sure you have the access-logs.yaml file at the same level as the following test bash script:
#!/usr/bin/env bash
set -Eeuo pipefail
die() {
kubectl delete namespace $namespace
echo -e "$0" ERROR: "$@" >&2; exit 1;
}
# shellcheck disable=2154
trap 's=$?; die "line $LINENO - $BASH_COMMAND"; exit $s' ERR
namespace="test"
kubectl apply -f access-logs.yaml
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: test
EOF
kubectl apply --namespace $namespace -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: simple-ingress
spec:
defaultBackend:
service:
name: alpaca
port:
number: 8080
EOF
sleep 1s
kubectl get ingress --namespace="$namespace" -o json | \
jq -e '.items[]
| .metadata.annotations."alb.ingress.kubernetes.io\/load-balancer-attributes"
| . != null and contains("access_logs.s3.enabled=true") and contains("access_logs.s3.bucket") and contains("access_logs.s3.prefix")'
kubectl delete namespace "$namespace"
# Next review the UR and the logs from Kyverno's background controller.
# NOTE: make sure to delete the cpol access-logs before running this script again if you would like
# to reproduce the bug without errors from the script
# i.e. kubectl get ur -n kyverno, kubectl logs -n kyverno kyverno-background-controller-<hash>
# And later for reproducibility of the bug: kubectl delete cpol access-logs
# You should see logs like the following:
# I0721 19:20:51.694459 1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"
# E0721 19:20:51.893001 1 labels.go:15] "msg"="failed to get the namespace" "error"="namespace \"test\" not found" "name"="test"
# E0721 19:20:51.923168 1 mutate.go:160] background "msg"="" "error"="failed to mutate existing resource, rule responseerror: : ingresses.networking.k8s.io \"simple-ingress\" not found" "name"="ur-dqckm" "policy"="access-logs" "resource"="networking.k8s.io/v1/Ingress/test/simple-ingress"
# I0721 19:20:51.923219 1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"
# E0721 19:20:52.120501 1 labels.go:15] "msg"="failed to get the namespace" "error"="namespace \"test\" not found" "name"="test"
# E0721 19:20:52.146068 1 mutate.go:160] background "msg"="" "error"="failed to mutate existing resource, rule responseerror: : ingresses.networking.k8s.io \"simple-ingress\" not found" "name"="ur-dqckm" "policy"="access-logs" "resource"="networking.k8s.io/v1/Ingress/test/simple-ingress"
# I0721 19:20:52.146228 1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"
- There are instructions/descriptions in the script above, but after creating the test ns and ing, check the UR for the ing
- Check the background controller logs
- After reviewing what is needed, if you want to reproduce this bug with fidelity make sure to run
kubectl delete cpol access-logs
Expected behavior
I expect Kyverno to fail gracefully, updating URs if a k8s resource is deleted and clearly no longer exists in the cluster.
Screenshots
Kyverno logs
E0721 19:51:08.501074 1 labels.go:15] "msg"="failed to get the namespace" "error"="namespace \"test\" not found" "name"="test"
E0721 19:51:08.515471 1 mutate.go:160] background "msg"="" "error"="failed to mutate existing resource, rule responseerror: : ingresses.networking.k8s.io \"simple-ingress\" not found" "name"="ur-hgltv" "policy"="access-logs" "resource"="networking.k8s.io/v1/Ingress/test/simple-ingress"
I0721 19:51:08.515521 1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"
E0721 19:51:08.748892 1 labels.go:15] "msg"="failed to get the namespace" "error"="namespace \"test\" not found" "name"="test"
E0721 19:51:08.762866 1 mutate.go:160] background "msg"="" "error"="failed to mutate existing resource, rule responseerror: : ingresses.networking.k8s.io \"simple-ingress\" not found" "name"="ur-hgltv" "policy"="access-logs" "resource"="networking.k8s.io/v1/Ingress/test/simple-ingress"
I0721 19:51:08.762892 1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"
E0721 19:51:08.957749 1 labels.go:15] "msg"="failed to get the namespace" "error"="namespace \"test\" not found" "name"="test"
E0721 19:51:08.973111 1 mutate.go:160] background "msg"="" "error"="failed to mutate existing resource, rule responseerror: : ingresses.networking.k8s.io \"simple-ingress\" not found" "name"="ur-hgltv" "policy"="access-logs" "resource"="networking.k8s.io/v1/Ingress/test/simple-ingress"
I0721 19:51:08.973145 1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"
Slack discussion
Troubleshooting
- I have read and followed the documentation AND the troubleshooting guide.
- I have searched other issues in this repository and mine is not recorded.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 17 (7 by maintainers)
@benoitschipper new issue is #9089
FYI, I’m a member of the original reporter’s team and can confirm that this did NOT fix the issue for us…
Having examined the code change for the fix, I’m pretty certain this is because the updaterequests getting stuck are for UPDATE admission requests, which appear to go straight to a code branch that doesn’t include any of the new retry limit logic.
I’ll open a new issue with the relevant details.
@benoitschipper - Kyverno 1.11.0 has fixed this issue by adding the retry limit for the mutate existing policy https://github.com/kyverno/kyverno/pull/8100. The UR will be deleted it if fails more than 3 times.
Hi, i tried this workaround (1.10.3). The first option (only with CREATE) works fine. But with the second i still get errors when i delete the namespace Any more suggestions i could try?
By the way: I assume “UPDATE AND ALSO” was not to be taken literal, I implemented it as: preconditions: all: - key: “{{ request.operation }}” operator: AnyIn value: - UPDATE - CREATE