kubernetes: "Error Cleaning Up Jobs" From Cronjob, Which is Not Really an Error

What happened:

This started happening when we upgraded to Kubernetes 1.21. The following message occurs:

Jul 08 07:00:01        1 cronjob_controllerv2.go:193] "error cleaning up jobs" cronjob="default/vet-messenger-batch-prod-hourly" resourceVersion="160274892" err="Operation cannot be fulfilled on cronjobs.batch \"vet-messenger-batch-prod-hourly\": the object has been modified; please apply your changes to the latest version and try again"
Jul 08 07:00:01        1 cronjob_controllerv2.go:154] error syncing CronJobController default/vet-messenger-batch-prod-hourly, requeuing: Operation cannot be fulfilled on cronjobs.batch "vet-messenger-batch-prod-hourly": the object has been modified; please apply your changes to the latest version and try again

The Cronjob is pretty simple:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: vet-messenger-batch-prod-hourly
spec:
  schedule: "0 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: vet-messenger-batch-prod
              image: REDACTED/vet-messenger:prod
              command: ["yarn", "batch:start:prod"]
              imagePullPolicy: Always
              env:
                - name: AWS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: k8s-container-runner-prod
                      key: aws_access_key_id
                - name: AWS_SECRET_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: k8s-container-runner-prod
                      key: aws_secret_access_key
                - name: AWS_REGION
                  value: us-east-1
          restartPolicy: Never
      backoffLimit: 1

What you expected to happen:

No error. But in fact, it does appear jobs ARE being removed properly regardless of the error message.

How to reproduce it (as minimally and precisely as possible):

Use the above.

Anything else we need to know?:

We tried deleting and recreating the cronjob both with kubectl create -f and kubectl apply -f and neither seems to work. Note the API version was at batch/v1beta1 before we upgraded, and we did change it to batch/v1 to squelch the deprecation notice.

Environment:

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: AWS
  • OS (e.g: cat /etc/os-release): From nodes (master is identical):
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
  • Kernel (e.g. uname -a): Linux ip-10-92-217-104 5.4.0-1047-aws #49-Ubuntu SMP Wed Apr 28 22:47:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: Kops
  • Network plugin and version (if this is a network-related bug):
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 19 (10 by maintainers)

Most upvoted comments

@pere3 Thanks for reminding me. I’ve updated the PR.

can you please confirm, that this bug is what causing CronJobs to stuck on JobAlreadyActive reason?

OK, I’ll try to take a closer look at it recently.

@pere3 Thanks for reminding me. I’ve updated the PR.

We are also seeing this error message in our cloudwatch logs.

Some of our cronjobs say that there’s an active job running, yet the job referenced has already been completed. Could that be a result of this bug?

For example:

kubectl describe job check-saml2-metadata-27222245

Name:           check-saml2-metadata-27222245
Namespace:      cloud20
Selector:       controller-uid=864e4340-c9a8-45cb-a035-b9dac399cbf6
Labels:         instance=check-saml2-metadata
                name=cron
                tags.datadoghq.com/env=cloud20
                tags.datadoghq.com/service=cron
                tags.datadoghq.com/version=432b9e0
Annotations:    <none>
Controlled By:  CronJob/check-saml2-metadata
Parallelism:    1
Completions:    1
Start Time:     Mon, 04 Oct 2021 01:05:00 -0700
Completed At:   Mon, 04 Oct 2021 01:05:43 -0700
Duration:       43s
Pods Statuses:  0 Running / 1 Succeeded / 0 Failed

As you can see it successfully completed. Yet the cronjob lists it as active:

kubectl describe cronjob check-saml2-metadata

Active Jobs:         check-saml2-metadata-27222245
Events:
  Type    Reason            Age                  From                Message
  ----    ------            ----                 ----                -------
  Normal  JobAlreadyActive  12m (x24 over 3d5h)  cronjob-controller  Not starting job because prior execution is running and concurrency policy is Forbid

Relevant Cloudwatch logs:

I1004 08:05:00.316858 1 event.go:291] "Event occurred" object="cloud20/check-saml2-metadata-27222245" kind="Job" apiVersion="batch/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: check-saml2-metadata-27222245-kw2h7"
I1004 08:05:00.362801 1 cronjob_controllerv2.go:193] "error cleaning up jobs" cronjob="cloud20/check-saml2-metadata" resourceVersion="5137310" err="Operation cannot be fulfilled on cronjobs.batch \"check-saml2-metadata\": the object has been modified; please apply your changes to the latest version and try again"
E1004 08:05:00.362816 1 cronjob_controllerv2.go:154] error syncing CronJobController cloud20/check-saml2-metadata, requeuing: Operation cannot be fulfilled on cronjobs.batch "check-saml2-metadata": the object has been modified; please apply your changes to the latest version and try again
I1004 08:05:43.264713 1 event.go:291] "Event occurred" object="cloud20/check-saml2-metadata-27222245" kind="Job" apiVersion="batch/v1" type="Normal" reason="Completed" message="Job completed"

seeing the same thing on GKE 1.21.5-gke.1302

It’s a bug of controllerv2 for cronjob. It passes a cronjob with old resourceVersion to update its status when cleaning up jobs. The CronJobControllerV2 feature flag graduates to beta and set to be enabled by default in v1.21. So , this started happening when you upgraded to Kubernetes 1.21. I will try to submit a PR to fix it. /assign