kyverno: [Bug] Update Requests stuck in pending/fail infinite loop when deleting k8s resources too quickly

Kyverno Version

1.10.0

Kubernetes Version

1.24.x

Kubernetes Platform

EKS

Kyverno Rule Type

Mutate

Description

URs get stuck in a pending/fail infinite loop when our test script from a CI/CD pipeline does the following:

  • Creates a temporary test namespace and test ingresses in the cluster
  • Uses kubectl and jq to check that annotations have been applied to the cluster by Kyverno
  • After jq confirms the annotations have been applied by Kyverno, the script immediately deletes the test namespace and test ingresses from the cluster
  • When I go to the cluster and check the UR’s, I find that they are stuck in a pending/fail infinite loop, with errors like the below

Demo of the infinite pending/fail loop: Screencast from 2023-07-21 14-39-56

Even though I am concerned with EKS, I was able to reproduce this bug with an exaggeratedly simple set up on my local machine using rancher’s k3s as well.

Here is the sample MutateExsiting policy, with sensitive values removed:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: access-logs
spec:
  validationFailureAction: Enforce
  mutateExistingOnPolicyUpdate: true
  rules:
  - name: reconcile-load-balancer-annotations
    context:
    - name: nestedMetadata
      variable:
        value: "request.object.metadata"
    - name: ingNamespace
      variable:
        jmesPath: "{{ nestedMetadata }}.namespace"
        default: ""
    - name: ingName
      variable:
        jmesPath: "{{ nestedMetadata }}.name"
        default: ""
    - name: lbAttributes
      variable:
        jmesPath: "{{ nestedMetadata }}.annotations.\"alb.ingress.kubernetes.io/load-balancer-attributes\""
        default: ""
    - name: s3Bucket
      variable:
        value: "access_logs.s3.bucket=test-bucket"
    - name: s3Enabled
      variable:
        value: "access_logs.s3.enabled=true"
    - name: s3Prefix
      variable:
        value: "access_logs.s3.prefix=test-prefix"
    match:
      any:
      - resources:
          kinds:
          - Ingress
    mutate:
      targets:
      - apiVersion: networking.k8s.io/v1
        kind: Ingress
        name: "{{ ingName }}"
        namespace: "{{ ingNamespace }}"
        preconditions:
          all:
          - key: "{{
                    lbAttributes |
                    (
                      contains(@, '{{ s3Bucket }}') &&
                      contains(@, '{{ s3Enabled }}') &&
                      contains(@, '{{ s3Prefix }}')
                    )
                  }}"
            operator: NotEquals
            value: true
      patchStrategicMerge:
        metadata:
          annotations:
            alb.ingress.kubernetes.io/load-balancer-attributes: "{{
                                                                  lbAttributes |
                                                                  (
                                                                    contains(@, '{{ s3Bucket }}') && @ ||
                                                                    (
                                                                      contains(@,'access_logs.s3.bucket') &&
                                                                      regex_replace_all_literal('access_logs.s3.bucket[^,]+|$]', @, '{{ s3Bucket }}') ||
                                                                      (
                                                                        length(@) > `0` &&
                                                                        join(',', ['{{ s3Bucket }}',@]) ||
                                                                        '{{ s3Bucket }}'
                                                                      )
                                                                    )
                                                                  ) |
                                                                  (
                                                                    contains(@, '{{ s3Prefix }}') && @ ||
                                                                    (
                                                                      contains(@,'access_logs.s3.prefix') &&
                                                                      regex_replace_all_literal('access_logs.s3.prefix[^,]+|$]', @, '{{ s3Prefix }}') ||
                                                                      join(',', ['{{ s3Prefix }}',@])
                                                                    )
                                                                  ) |
                                                                  (
                                                                    contains(@, '{{ s3Enabled }}') && @ ||
                                                                    (
                                                                      contains(@,'access_logs.s3.enabled') &&
                                                                      regex_replace_all_literal('access_logs.s3.enabled[^,]+|$]', @, '{{ s3Enabled}}') ||
                                                                      join(',', ['{{ s3Enabled }}',@])
                                                                    )
                                                                  )
                                                                }}"

Steps to reproduce

  1. Make sure Kyverno v1.10.0 is up and running on Kubernetes 1.24.x
  2. Make sure you have the access-logs.yaml file at the same level as the following test bash script:
#!/usr/bin/env bash
set -Eeuo pipefail
die() { 
  kubectl delete namespace $namespace
  echo -e "$0" ERROR: "$@" >&2; exit 1; 
}
# shellcheck disable=2154
trap 's=$?; die "line $LINENO - $BASH_COMMAND"; exit $s' ERR

namespace="test"

kubectl apply -f access-logs.yaml

kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: test
EOF

kubectl apply --namespace $namespace -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: simple-ingress
spec:
  defaultBackend:
    service:
      name: alpaca
      port:
        number: 8080
EOF

sleep 1s

kubectl get ingress --namespace="$namespace" -o json | \
  jq -e '.items[]
    | .metadata.annotations."alb.ingress.kubernetes.io\/load-balancer-attributes"
    | . != null and contains("access_logs.s3.enabled=true") and contains("access_logs.s3.bucket") and contains("access_logs.s3.prefix")'

kubectl delete namespace "$namespace"

# Next review the UR and the logs from Kyverno's background controller.
# NOTE: make sure to delete the cpol access-logs before running this script again if you would like
# to reproduce the bug without errors from the script
# i.e. kubectl get ur -n kyverno, kubectl logs -n kyverno kyverno-background-controller-<hash>
# And later for reproducibility of the bug: kubectl delete cpol access-logs

# You should see logs like the following:
# I0721 19:20:51.694459       1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"
# E0721 19:20:51.893001       1 labels.go:15]  "msg"="failed to get the namespace" "error"="namespace \"test\" not found" "name"="test"
# E0721 19:20:51.923168       1 mutate.go:160] background "msg"="" "error"="failed to mutate existing resource, rule responseerror: : ingresses.networking.k8s.io \"simple-ingress\" not found" "name"="ur-dqckm" "policy"="access-logs" "resource"="networking.k8s.io/v1/Ingress/test/simple-ingress"
# I0721 19:20:51.923219       1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"
# E0721 19:20:52.120501       1 labels.go:15]  "msg"="failed to get the namespace" "error"="namespace \"test\" not found" "name"="test"
# E0721 19:20:52.146068       1 mutate.go:160] background "msg"="" "error"="failed to mutate existing resource, rule responseerror: : ingresses.networking.k8s.io \"simple-ingress\" not found" "name"="ur-dqckm" "policy"="access-logs" "resource"="networking.k8s.io/v1/Ingress/test/simple-ingress"
# I0721 19:20:52.146228       1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"
  1. There are instructions/descriptions in the script above, but after creating the test ns and ing, check the UR for the ing
  2. Check the background controller logs
  3. After reviewing what is needed, if you want to reproduce this bug with fidelity make sure to run kubectl delete cpol access-logs

Expected behavior

I expect Kyverno to fail gracefully, updating URs if a k8s resource is deleted and clearly no longer exists in the cluster.

Screenshots

Screenshot from 2023-07-21 14-48-59

Kyverno logs

E0721 19:51:08.501074       1 labels.go:15]  "msg"="failed to get the namespace" "error"="namespace \"test\" not found" "name"="test"
E0721 19:51:08.515471       1 mutate.go:160] background "msg"="" "error"="failed to mutate existing resource, rule responseerror: : ingresses.networking.k8s.io \"simple-ingress\" not found" "name"="ur-hgltv" "policy"="access-logs" "resource"="networking.k8s.io/v1/Ingress/test/simple-ingress"
I0721 19:51:08.515521       1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"
E0721 19:51:08.748892       1 labels.go:15]  "msg"="failed to get the namespace" "error"="namespace \"test\" not found" "name"="test"
E0721 19:51:08.762866       1 mutate.go:160] background "msg"="" "error"="failed to mutate existing resource, rule responseerror: : ingresses.networking.k8s.io \"simple-ingress\" not found" "name"="ur-hgltv" "policy"="access-logs" "resource"="networking.k8s.io/v1/Ingress/test/simple-ingress"
I0721 19:51:08.762892       1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"
E0721 19:51:08.957749       1 labels.go:15]  "msg"="failed to get the namespace" "error"="namespace \"test\" not found" "name"="test"
E0721 19:51:08.973111       1 mutate.go:160] background "msg"="" "error"="failed to mutate existing resource, rule responseerror: : ingresses.networking.k8s.io \"simple-ingress\" not found" "name"="ur-hgltv" "policy"="access-logs" "resource"="networking.k8s.io/v1/Ingress/test/simple-ingress"
I0721 19:51:08.973145       1 mutate.go:234] background/mutateExisting "msg"="cannot generate events for empty target resource" "policy"="access-logs" "rule"="reconcile-load-balancer-annotations"

Slack discussion

https://kubernetes.slack.com/archives/CLGR9BJU9/p1689951073735509?thread_ts=1689785601.572189&cid=CLGR9BJU9

Troubleshooting

  • I have read and followed the documentation AND the troubleshooting guide.
  • I have searched other issues in this repository and mine is not recorded.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 17 (7 by maintainers)

Most upvoted comments

Tx @nipr-jdoenges ! Please tag me or somehow tag me along 😃 appreciate your feedback. I’m struggling with race conditions sometimes with Kyverno and sometimes it’s hard to pinpoint what causes the race condition. Appreciate it!

@benoitschipper new issue is #9089

@benoitschipper - Kyverno 1.11.0 has fixed this issue by adding the retry limit for the mutate existing policy #8100. The UR will be deleted it if fails more than 3 times.

FYI, I’m a member of the original reporter’s team and can confirm that this did NOT fix the issue for us…

Having examined the code change for the fix, I’m pretty certain this is because the updaterequests getting stuck are for UPDATE admission requests, which appear to go straight to a code branch that doesn’t include any of the new retry limit logic.

I’ll open a new issue with the relevant details.

@benoitschipper - Kyverno 1.11.0 has fixed this issue by adding the retry limit for the mutate existing policy https://github.com/kyverno/kyverno/pull/8100. The UR will be deleted it if fails more than 3 times.

Per Slack:

Ok, so what looks like is happening here is an UPDATE request is sent as a result of the mutation which comes in, is recorded in a UR, but then because the Namespace has already been deleted and the Ingress garbage collected, Kyverno goes into a never-ending UR storm where they oscillate between Pending and Failed states. We’ve seen this before in some situations. I think we need to make some tweaks to this system. That said, what you could do is this:

1. Modify the policy so as to match on CREATE requests only. Since you're on 1.10 already, you can do this conveniently with the operations[] list under resources[]. This will solve the issue you're seeing, however this will prevent subsequent updates of the policy from triggering the mutation on those existing Ingresses. For that, see no. 2.

2. When/if you need to make retroactive mutations against all those ingresses, assuming you've done no. 1 above, modify the policy to change the CREATE verb to UPDATE AND ALSO make your intended policy modification. This will correctly perform the mutation on the existing Ingress resources. If you intended to repeat your similar test process where you'll delete Namespaces quickly after an UPDATE, you'll want to go back into the policy and revert it to CREATE.

Hi, i tried this workaround (1.10.3). The first option (only with CREATE) works fine. But with the second i still get errors when i delete the namespace Any more suggestions i could try?

By the way: I assume “UPDATE AND ALSO” was not to be taken literal, I implemented it as: preconditions: all: - key: “{{ request.operation }}” operator: AnyIn value: - UPDATE - CREATE