kubernetes: EBS fails to detach after controller manager is restarted
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): No
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): EBS
Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
Kubernetes version (use kubectl version
):
1.4.8
Environment:
- Cloud provider or hardware configuration: AWS
- OS (e.g. from /etc/os-release): Ubuntu 16.04.2
- Kernel (e.g.
uname -a
): 4.8.0 - Install tools: custom ansible
- Others:
What happened: We have an HA controller manager setup. After a controller manager was restarted, a new controller manager became the leader. We observed after this happens that some EBS fail to detach when a pod is redeployed. It seems that controller-manager forgets about some attached EBS after it is restarted.
What you expected to happen: After a new controller manager starts, if an existing pod with an EBS gets redeployed, the EBS should be detached first.
How to reproduce it (as minimally and precisely as possible): Unfortunately I’ve been unable to reproduce this consistently - it doesn’t always happen, but when it does it’s always after a controller manager got restarted.
Anything else we need to know:
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 52 (52 by maintainers)
Commits related to this issue
- Turn up logging for EBS problem https://github.com/kubernetes/kubernetes/issues/43300 — committed to jsravn/kubernetes by jsravn 7 years ago
@jingxu97 @gnufied @jsravn
Me either
The informers will resync their items every so often, as defined by either the global shared informer factory resync period, or the per-event handler resync period. But it’s better to use
WaitForCacheSync
in the controller’sRun
function.You can and should call
WaitForCacheSync
inRun
. This is the common pattern in our controllers. The typical flow is:someInformer.Informer().AddEventHandler()
Run
, callWaitForCacheSync
Run
, once the caches are synced, start your worker goroutinesRun
functions, we callStart
on the shared informer factorySee https://github.com/kubernetes/kubernetes/blob/8e26fa25da6d3b1deb333fe2484f794795d1c6b9/staging/src/k8s.io/kube-aggregator/pkg/controllers/autoregister/autoregister_controller.go for an example of a typical controller using a work queue.
If all we do is uncomment the call to
WaitForCacheSync
and somehow fix the integration test I referenced, you’ll still have a situation where the event handler functions (e.g.podAdd/Update/Delete
,nodeAdd/Update/Delete
) may be called beforeWaitForCacheSync
unblocks. I’m guessing that might cause issues, which is why you’d want to convert to using work queues.