kubernetes: Conflict error when create pod because resource quota version is old

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: In k8s.io/client-go/util/workqueue/queue.go https://github.com/kubernetes/client-go/blob/master/util/workqueue/queue.go 1.

func (q *Type) Add(item interface{}) {
	q.cond.L.Lock()
	defer q.cond.L.Unlock()
	if q.shuttingDown {
		return
	}
	if q.dirty.has(item) {
		return
	}

	q.metrics.add(item)

	q.dirty.insert(item)
	if q.processing.has(item) {
		return
	}

	q.queue = append(q.queue, item)
	q.cond.Signal()
}

func (q *Type) Get() (item interface{}, shutdown bool) {
	q.cond.L.Lock()
	defer q.cond.L.Unlock()
	for len(q.queue) == 0 && !q.shuttingDown {
		q.cond.Wait()
	}
	if len(q.queue) == 0 {
		// We must be shutting down.
		return nil, true
	}

	item, q.queue = q.queue[0], q.queue[1:]

	q.metrics.get(item)

	q.processing.insert(item)
	q.dirty.delete(item)

	return item, false
}

In add() function , we want to add the unique item in the queue. In get() function, if more than 1 consumers , they will be waiting until producer put item into the queue.

If 10 consumers are waiting and 10 same items are put into the queue, the 10 items will be all consumed. Maybe we expect only 1 consumer to consume the same item because the item is reduplicated.
2. Everything is OK, but there is some problems in kube-controller-manager. I use the example of resource_quota_controller: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/resourcequota/resource_quota_controller.go There are 6 workers are working.

        for i := 0; i < workers; i++ {
		go wait.Until(rq.worker(rq.queue), time.Second, stopCh)
		go wait.Until(rq.worker(rq.missingUsageQueue), time.Second, stopCh)
	}

func (rq *ResourceQuotaController) worker(queue workqueue.RateLimitingInterface) func() {
	workFunc := func() bool {
		key, quit := queue.Get()
		if quit {
			return true
		}
		defer queue.Done(key)
		err := rq.syncHandler(key.(string))
		if err == nil {
			queue.Forget(key)
			return false
		}
		utilruntime.HandleError(err)
		queue.AddRateLimited(key)
		return false
	}

	return func() {
		for {
			if quit := workFunc(); quit {
				glog.Infof("resource quota controller worker shutting down")
				return
			}
		}
	}
}

When we update or delete a deployment, if there are 10 pod in this deployment, there are 10 same key(the namespace/name of quota) will be put into the queue. And the 6 workers are waked up and to deal with the same key.

Quota version maybe expires after re-calculation and the expired quota will update unsuccessfully because other workers have updated the quota.

There are too many conflict error will be logged.

Operation cannot be fulfilled on resourcequotas \"pod-quota\": the object has been modified; please apply your changes to the latest version and try again

Up to now, there is only warning and no error is happened. But We maybe fail to create pod this time, the error info is :

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"pod-quota\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"pod-quota","kind":"resourcequotas"},"code":409}

Because the resource quota will calculate and update before create a pod. And if some error occurs, the function will retry 3 times. And if api-server doesn’t update the quota successfully after 3 times, it will return back an error for user.

e.checkQuotas(quotas, admissionAttributes, 3)
func (e *quotaEvaluator) checkQuotas(quotas []api.ResourceQuota, admissionAttributes []*admissionWaiter, remainingRetries int) {
   ......
    for i := range quotas {
		newQuota := quotas[i]

		// if this quota didn't have its status changed, skip it
		if quota.Equals(originalQuotas[i].Status.Used, newQuota.Status.Used) {
			continue
		}

		if err := e.quotaAccessor.UpdateQuotaStatus(&newQuota); err != nil {
			updatedFailedQuotas = append(updatedFailedQuotas, newQuota)
			lastErr = err
		}
   }
   e.checkQuotas(quotasToCheck, admissionAttributes, remainingRetries-1)
}

How to reproduce it (as minimally and precisely as possible):

create a resource quota , and create 50 pod by 1 deployment
delete the deployment and you can watch so many log like this:“please apply your changes to the latest version and try again”

I hope that:

reduce the conflict log and error by up the log level?
not wait but return if the queue is empty. or add an function like “peek()”?
add retry delay when error occurs in create pod period?

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 12
Comments: 37 (11 by maintainers)

Commits related to this issue

Retry 409 errors in CL2 We noticed runs where some calls failed with: ``` measurement call InClusterNetworkLatency - InClusterNetworkLatency error: error while applying (/go/src/k8s.io/perf-tests/clu... — committed to mm4tt/perf-tests by mm4tt 4 years ago
Retry 409 errors in CL2 We noticed runs where some calls failed with: ``` measurement call InClusterNetworkLatency - InClusterNetworkLatency error: error while applying (/go/src/k8s.io/perf-tests/clu... — committed to mm4tt/perf-tests by mm4tt 4 years ago
Retry 409 resourcequotas errors in CL2 We noticed runs where some calls failed with: ``` measurement call InClusterNetworkLatency - InClusterNetworkLatency error: error while applying (/go/src/k8s.io... — committed to mm4tt/perf-tests by mm4tt 4 years ago
Retry Creates on resourcequota conflicts. This follows @vaikas PR here: https://github.com/knative/eventing/pull/3215 The kubernetes issue is tracked here: https://github.com/kubernetes/kubernetes/i... — committed to mattmoor/networking by mattmoor 4 years ago
Retry Creates on resourcequota conflicts. (#79) This follows @vaikas PR here: https://github.com/knative/eventing/pull/3215 The kubernetes issue is tracked here: https://github.com/kubernetes/kubern... — committed to knative/networking by mattmoor 4 years ago
Retry Creates on resourcequota conflicts. (#79) This follows @vaikas PR here: https://github.com/knative/eventing/pull/3215 The kubernetes issue is tracked here: https://github.com/kubernetes/kubern... — committed to mattmoor/networking by mattmoor 4 years ago
[release 0.17] Cherrypick #79 and #82 (#86) * Retry Creates on resourcequota conflicts. (#79) This follows @vaikas PR here: https://github.com/knative/eventing/pull/3215 The kubernetes issue is tra... — committed to knative/networking by mattmoor 4 years ago
Work around conflict error because resource quota version is old Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects's Resou... — committed to andreaskaris/helm by andreaskaris 3 years ago
Work around conflict error because resource quota version is old Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects's Resou... — committed to andreaskaris/helm by andreaskaris 3 years ago
Work around conflict error because resource quota version is old Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects's Resou... — committed to andreaskaris/helm by andreaskaris 3 years ago
Work around conflict error because resource quota version is old Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects's Resou... — committed to andreaskaris/helm by andreaskaris 3 years ago
Work around conflict error because resource quota version is old Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects's Resou... — committed to andreaskaris/helm by andreaskaris 3 years ago
Work around conflict error because resource quota version is old Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects's Resou... — committed to andreaskaris/helm by andreaskaris 3 years ago
Work around conflict error because resource quota version is old Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects's Resou... — committed to andreaskaris/helm by andreaskaris 3 years ago
Work around conflict error because resource quota version is old Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects's Resou... — committed to andreaskaris/helm by andreaskaris 3 years ago
fix(helm): Work around conflict error due to old resource quota version Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects'... — committed to andreaskaris/helm by andreaskaris 3 years ago
fix(helm): Work around conflict error due to old resource quota version Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects'... — committed to andreaskaris/helm by andreaskaris 3 years ago
fix(helm): Work around conflict error due to old resource quota version Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects'... — committed to andreaskaris/helm by andreaskaris 3 years ago
fix(helm): Work around conflict error due to old resource quota version Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects'... — committed to andreaskaris/helm by andreaskaris 3 years ago
fix(helm): Work around conflict error due to old resource quota version Due to the following issue, kubernetes might seemingly randomly throw a 409 Conflict Error code when it manipulates a projects'... — committed to andreaskaris/helm by andreaskaris 3 years ago

Most upvoted comments

Just encountered this - seems like a lot of clients are building in their own retry logic 🤷‍♂️

Unsure if this is something that can be generalized on the API server - ie. when creating objects and you hit a conflict with a dependent resource quota - retry

dprotaso on Apr 21, 2021

Another place where we see this happening: in our Airflow (https://github.com/apache/airflow) workflow there is a point when lots of pods are created at the same time to run some calculations in parallel. From time to time, we hit this bug:

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"delphi-resourcequota\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"delphi-resourcequota","kind":"resourcequotas"},"code":409}

ewjmulder on Jun 29, 2020

We are also facing the same issue in our production cluster. I was able to reproduce it even on local minikube as described above, with a resourcequota and a deployment with 50 replicas. When deployment is deleting there are many errors in kube-controller-manager-minikube pod logs:

E0326 14:15:22.541986       1 resource_quota_controller.go:251] Operation cannot be fulfilled on resourcequotas "quota": the object has been modified; please apply your changes to the latest version and try again
E0326 14:15:57.295237       1 resource_quota_controller.go:251] Operation cannot be fulfilled on resourcequotas "quota": the object has been modified; please apply your changes to the latest version and try again

$ kubectl version --short
Client Version: v1.16.3
Server Version: v1.16.3

Is there any workaround?

aandryashin on Mar 26, 2020

We are experiencing this issue, specifically with Service resources for some reason, although we do have resource quota on many other types of resources.

Our cluster info: version.Info{ Major:"1", Minor:"15", GitVersion:"v1.15.6", GitCommit:"7015f71e75f670eb9e7ebd4b5749639d42e20079", GitTreeState:"clean", BuildDate:"2019-11-13T11:11:50Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64" }

ddudnik on Jan 22, 2020

We are facing the similar issue when we upgraded the Kube to 1.20.15 and moved Kube services to Binary. Is anyone able to find any workaround except the one raised above in helm. ? I am bit hesitant to rollout this in production with this bug as most of our application deployments are helm based and thorough pipelines. It would be great if someone can share and guide us how to fix this or workaround this issue. Thanks

amitsingla on Jul 22, 2022

Also see this when installing the GitLab helm chart in CI

Release "gitlab" does not exist. Installing it now.
Error: Operation cannot be fulfilled on resourcequotas "gke-resource-quotas": the object has been modified; please apply your changes to the latest version and try again

which causes that operation to fail.

MikaelSmith on Feb 16, 2021

@Nebulazhang Are you still around to reopen this issue? It aged-out, but clearly is still going on.

TBBle on Mar 26, 2020