kubernetes: EBS fails to detach after controller manager is restarted

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): EBS

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Kubernetes version (use kubectl version): 1.4.8

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.2
  • Kernel (e.g. uname -a): 4.8.0
  • Install tools: custom ansible
  • Others:

What happened: We have an HA controller manager setup. After a controller manager was restarted, a new controller manager became the leader. We observed after this happens that some EBS fail to detach when a pod is redeployed. It seems that controller-manager forgets about some attached EBS after it is restarted.

What you expected to happen: After a new controller manager starts, if an existing pod with an EBS gets redeployed, the EBS should be detached first.

How to reproduce it (as minimally and precisely as possible): Unfortunately I’ve been unable to reproduce this consistently - it doesn’t always happen, but when it does it’s always after a controller manager got restarted.

Anything else we need to know:

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 52 (52 by maintainers)

Commits related to this issue

Most upvoted comments

@jingxu97 @gnufied @jsravn

I am not sure the comment about “Events will be still missed”

Me either

Shouldn’t the informer will replay the events during the period controller is stopped (not yet started)?

The informers will resync their items every so often, as defined by either the global shared informer factory resync period, or the per-event handler resync period. But it’s better to use WaitForCacheSync in the controller’s Run function.

sharedInformers.start() is called after the Run() (see controllermanager.go#StartControllers) so I don’t think you can just WaitForCacheSync in Run. You could do WaitForCacheSync in processPods though (like what I did).

You can and should call WaitForCacheSync in Run. This is the common pattern in our controllers. The typical flow is:

  1. In your controller’s constructor, call someInformer.Informer().AddEventHandler()
  2. The event handlers should add to a work queue
  3. In your controller’s Run, call WaitForCacheSync
  4. In your controller’s Run, once the caches are synced, start your worker goroutines
  5. The workers pop items off of the work queue and process them
  6. After executing all the controllers’ constructors and Run functions, we call Start on the shared informer factory

See https://github.com/kubernetes/kubernetes/blob/8e26fa25da6d3b1deb333fe2484f794795d1c6b9/staging/src/k8s.io/kube-aggregator/pkg/controllers/autoregister/autoregister_controller.go for an example of a typical controller using a work queue.

If all we do is uncomment the call to WaitForCacheSync and somehow fix the integration test I referenced, you’ll still have a situation where the event handler functions (e.g. podAdd/Update/Delete, nodeAdd/Update/Delete) may be called before WaitForCacheSync unblocks. I’m guessing that might cause issues, which is why you’d want to convert to using work queues.