kubernetes: Conflict error when create pod because resource quota version is old

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: In k8s.io/client-go/util/workqueue/queue.go https://github.com/kubernetes/client-go/blob/master/util/workqueue/queue.go 1.

func (q *Type) Add(item interface{}) {
	q.cond.L.Lock()
	defer q.cond.L.Unlock()
	if q.shuttingDown {
		return
	}
	if q.dirty.has(item) {
		return
	}

	q.metrics.add(item)

	q.dirty.insert(item)
	if q.processing.has(item) {
		return
	}

	q.queue = append(q.queue, item)
	q.cond.Signal()
}
func (q *Type) Get() (item interface{}, shutdown bool) {
	q.cond.L.Lock()
	defer q.cond.L.Unlock()
	for len(q.queue) == 0 && !q.shuttingDown {
		q.cond.Wait()
	}
	if len(q.queue) == 0 {
		// We must be shutting down.
		return nil, true
	}

	item, q.queue = q.queue[0], q.queue[1:]

	q.metrics.get(item)

	q.processing.insert(item)
	q.dirty.delete(item)

	return item, false
}

In add() function , we want to add the unique item in the queue. In get() function, if more than 1 consumers , they will be waiting until producer put item into the queue.

If 10 consumers are waiting and 10 same items are put into the queue, the 10 items will be all consumed. Maybe we expect only 1 consumer to consume the same item because the item is reduplicated.
2. Everything is OK, but there is some problems in kube-controller-manager. I use the example of resource_quota_controller: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/resourcequota/resource_quota_controller.go There are 6 workers are working.

        for i := 0; i < workers; i++ {
		go wait.Until(rq.worker(rq.queue), time.Second, stopCh)
		go wait.Until(rq.worker(rq.missingUsageQueue), time.Second, stopCh)
	}
func (rq *ResourceQuotaController) worker(queue workqueue.RateLimitingInterface) func() {
	workFunc := func() bool {
		key, quit := queue.Get()
		if quit {
			return true
		}
		defer queue.Done(key)
		err := rq.syncHandler(key.(string))
		if err == nil {
			queue.Forget(key)
			return false
		}
		utilruntime.HandleError(err)
		queue.AddRateLimited(key)
		return false
	}

	return func() {
		for {
			if quit := workFunc(); quit {
				glog.Infof("resource quota controller worker shutting down")
				return
			}
		}
	}
}

When we update or delete a deployment, if there are 10 pod in this deployment, there are 10 same key(the namespace/name of quota) will be put into the queue. And the 6 workers are waked up and to deal with the same key.

Quota version maybe expires after re-calculation and the expired quota will update unsuccessfully because other workers have updated the quota.

There are too many conflict error will be logged.

Operation cannot be fulfilled on resourcequotas \"pod-quota\": the object has been modified; please apply your changes to the latest version and try again

3.

Up to now, there is only warning and no error is happened. But We maybe fail to create pod this time, the error info is :

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"pod-quota\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"pod-quota","kind":"resourcequotas"},"code":409}

Because the resource quota will calculate and update before create a pod. And if some error occurs, the function will retry 3 times. And if api-server doesn’t update the quota successfully after 3 times, it will return back an error for user.

e.checkQuotas(quotas, admissionAttributes, 3)
func (e *quotaEvaluator) checkQuotas(quotas []api.ResourceQuota, admissionAttributes []*admissionWaiter, remainingRetries int) {
   ......
    for i := range quotas {
		newQuota := quotas[i]

		// if this quota didn't have its status changed, skip it
		if quota.Equals(originalQuotas[i].Status.Used, newQuota.Status.Used) {
			continue
		}

		if err := e.quotaAccessor.UpdateQuotaStatus(&newQuota); err != nil {
			updatedFailedQuotas = append(updatedFailedQuotas, newQuota)
			lastErr = err
		}
   }
   e.checkQuotas(quotasToCheck, admissionAttributes, remainingRetries-1)
}

How to reproduce it (as minimally and precisely as possible):

  1. create a resource quota , and create 50 pod by 1 deployment
  2. delete the deployment and you can watch so many log like this:“please apply your changes to the latest version and try again”

I hope that:

  1. reduce the conflict log and error by up the log level?
  2. not wait but return if the queue is empty. or add an function like “peek()”?
  3. add retry delay when error occurs in create pod period?

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 12
  • Comments: 37 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Just encountered this - seems like a lot of clients are building in their own retry logic 🤷‍♂️

Unsure if this is something that can be generalized on the API server - ie. when creating objects and you hit a conflict with a dependent resource quota - retry

Another place where we see this happening: in our Airflow (https://github.com/apache/airflow) workflow there is a point when lots of pods are created at the same time to run some calculations in parallel. From time to time, we hit this bug:

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"delphi-resourcequota\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"delphi-resourcequota","kind":"resourcequotas"},"code":409}

We are also facing the same issue in our production cluster. I was able to reproduce it even on local minikube as described above, with a resourcequota and a deployment with 50 replicas. When deployment is deleting there are many errors in kube-controller-manager-minikube pod logs:

E0326 14:15:22.541986       1 resource_quota_controller.go:251] Operation cannot be fulfilled on resourcequotas "quota": the object has been modified; please apply your changes to the latest version and try again
E0326 14:15:57.295237       1 resource_quota_controller.go:251] Operation cannot be fulfilled on resourcequotas "quota": the object has been modified; please apply your changes to the latest version and try again
$ kubectl version --short
Client Version: v1.16.3
Server Version: v1.16.3

Is there any workaround?

We are experiencing this issue, specifically with Service resources for some reason, although we do have resource quota on many other types of resources.

Our cluster info: version.Info{ Major:"1", Minor:"15", GitVersion:"v1.15.6", GitCommit:"7015f71e75f670eb9e7ebd4b5749639d42e20079", GitTreeState:"clean", BuildDate:"2019-11-13T11:11:50Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64" }

We are facing the similar issue when we upgraded the Kube to 1.20.15 and moved Kube services to Binary. Is anyone able to find any workaround except the one raised above in helm. ? I am bit hesitant to rollout this in production with this bug as most of our application deployments are helm based and thorough pipelines. It would be great if someone can share and guide us how to fix this or workaround this issue. Thanks

Also see this when installing the GitLab helm chart in CI

Release "gitlab" does not exist. Installing it now.
Error: Operation cannot be fulfilled on resourcequotas "gke-resource-quotas": the object has been modified; please apply your changes to the latest version and try again

which causes that operation to fail.

@Nebulazhang Are you still around to reopen this issue? It aged-out, but clearly is still going on.