kubernetes: Dynamic informers do not stop when custom resource definition is removed

What happened:

Once started, dynamic informers for custom resources are not stopped.

What you expected to happen:

After the dynamic informers resynced, they would stop informers belong to no-longer-existing resources.

How to reproduce it (as minimally and precisely as possible):

  1. Start a cluster with garbage collection and quota enabled with high enough verbosity to see API requests logged
  2. Create a custom resource definition
  3. Observe list/watch requests from the dynamic informers:
I0701 14:31:54.604559   77309 wrap.go:47] GET /apis/mygroup.example.com/v1/myresources?limit=500&resourceVersion=0: (382.601µs) 200 [hyperkube/v1.16.0 (darwin/amd64) kubernetes/40038d5/dynamic-informers 127.0.0.1:62616]
  1. Delete the custom resource definition
  2. Observe continuous list requests from the dynamic informers receiving a 404:
I0701 14:33:05.857207   77309 wrap.go:47] GET /apis/mygroup.example.com/v1/myresources?limit=500&resourceVersion=0: (346.987µs) 404 [hyperkube/v1.16.0 (darwin/amd64) kubernetes/40038d5/dynamic-informers 127.0.0.1:62616]
I0701 14:33:06.861218   77309 wrap.go:47] GET /apis/mygroup.example.com/v1/myresources?limit=500&resourceVersion=0: (349.903µs) 404 [hyperkube/v1.16.0 (darwin/amd64) kubernetes/40038d5/dynamic-informers 127.0.0.1:62616]
...

after the switch by the garbage collector and quota controllers to use the generic metadata client, the controller manager logs around this are even harder to understand:

2019-11-16T01:18:20.15247609Z stderr F E1116 01:18:20.152307       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.450468558Z stderr F E1116 01:18:20.450095       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.632847838Z stderr F E1116 01:18:20.625706       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.693940889Z stderr F E1116 01:18:20.693774       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.740153328Z stderr F E1116 01:18:20.739963       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.740400594Z stderr F E1116 01:18:20.740276       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.820833937Z stderr F E1116 01:18:20.820695       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.848726246Z stderr F E1116 01:18:20.848605       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.856013314Z stderr F E1116 01:18:20.855816       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.950015373Z stderr F E1116 01:18:20.949837       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:20.998885723Z stderr F E1116 01:18:20.998732       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
2019-11-16T01:18:21.001894579Z stderr F E1116 01:18:21.001726       1 reflector.go:156] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

/sig api-machinery /priority important-soon /cc @sttts @jpbetz @deads2k

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 26
  • Comments: 35 (21 by maintainers)

Most upvoted comments

I think we might have found a workaround for this issue.We were seeing the same errors in our k8s clusters after we deleted some old CRDs to install new ones. Turns out that cycling the Kube-control-manager leader (docker kill on the master), helped us fix the issue.

Is there a new status or known workarounds for this issue? We also have the problem and it spams about one message per second into kube-controller-manager and audit logs.

👀

im seeing this to

Hi! I’m sorry to bump it up again, but could you advise any workaround if there’s no solution to that yet? It seems that added CRDs (particularly, in Operator case) have to stay in the cluster forever even if they’re not needed. Otherwise, there’s a lot of spam in kube-controller-manager.

I ran into this recently and was thinking about ways to fix it. I agree with @deads2k that “unwatching” (or whether to unwatch) needs to be an explicit decision by the user of the client. Given that, I see 2 potential approaches:

  1. Per-informer Stop() function, with reference counting on the informers cache. This means each call to get a new informer would need to actually return a wrapper around the underlying shared informer, with it’s own Stop() implementation. Reference counting on each instantiation, and only when all are stopped is the informer actually terminated.
  2. Shared informer cache-level “stop watching when the resource goes away” setting. This is less powerful than (1) but would be sufficient to cover the kube-controller-manager usecase. This seems like it might be simpler to implement, but we would need to be careful to avoid race conditions around rapidly deleting & recreating CRDs.

I haven’t contributed much to our client code - is this the sort of decision that needs a KEP?

An alternative way I’ve found so far that works in my case is to use watch.Interface instead of using the dynamic informer before this issue getting fixed.

// GVR stands for schema.GroupVersionResource
// client -> dynamic.Interface

wi, err := client.Resource(GVR).Watch(metav1.ListOptions{})
if err != nil {
    return nil, errors.Wrapf(err, "error creating watch for target resource: %v", GVR)
}

for event := range wi.ResultChan() {
    switch event.Type {
    case watch.Added:
        curobj := event.Object.(*unstructured.Unstructured).DeepCopy()
        ...
    case watch.Modified:
        curobj := event.Object.(*unstructured.Unstructured).DeepCopy()

        // NOTICE: only new object will be sent
        ...
    case watch.Deleted:
        curobj := event.Object.(*unstructured.Unstructured).DeepCopy()
        ...
    }
}

// stop the watch when you delete the CRD.
wi.Stop()

In this way, you can rollback to informer implementation easily without too much effort.

@tallclair this is sounding a bit like Gatekeeper’s watch manager:

  • A central watch manager handles the set-up/tear down of dynamic watches (which is just an informer cache with the ability to remove GVKs)

  • Individual controllers can sign up for “registrars”, which is a controller-specific object that allows them to update the set of GVKs they are watching via set operations (add/remove/replace one or many)

  • Watch events are distributed to controllers by the watch manager by posting to individual channels

  • The central watch manager handles updates to registrars, setting up and/or tearing down watches based on the composite results

Happy to share experiences with it if desired. Biggest challenge would be around reconciling events for a watch that has already been torn down and avoiding race conditions if two reconcilers are acting on the same resource but only one uses a registrar.

I was thinking about adding two func’s to SharedInformer interface:

diff --git a/staging/src/k8s.io/client-go/tools/cache/shared_informer.go b/staging/src/k8s.io/client-go/tools/cache/shared_informer.go
index f59a0852fe..8025408a40 100644
--- a/staging/src/k8s.io/client-go/tools/cache/shared_informer.go
+++ b/staging/src/k8s.io/client-go/tools/cache/shared_informer.go
@@ -148,6 +148,11 @@ type SharedInformer interface {
        // store. The value returned is not synchronized with access to the underlying store and is not
        // thread-safe.
        LastSyncResourceVersion() string
+       // RemoveEventHandler removes an event handler from the shared informer
+       // It returns true if removal is successful
+       RemoveEventHandler(handler ResourceEventHandler) bool
+       // EventHandlerCount returns the number of event handlers currently registered with the shared informer
+       EventHandlerCount() int
 }

 // SharedIndexInformer provides add and get Indexers ability based on SharedInformer.

Given the above, sharedInformerFactory would be able to sideline informers whose count of event handlers reaches zero. In the following snippet, such sidelining is done in WaitForCacheSync (it can also be done in a new factory func):

diff --git a/staging/src/k8s.io/client-go/informers/factory.go b/staging/src/k8s.io/client-go/informers/factory.go
index c5230a491d..fcca7087d6 100644
--- a/staging/src/k8s.io/client-go/informers/factory.go
+++ b/staging/src/k8s.io/client-go/informers/factory.go
@@ -61,6 +61,8 @@ type sharedInformerFactory struct {
        customResync     map[reflect.Type]time.Duration

        informers map[reflect.Type]cache.SharedIndexInformer
+       // sidelined informers
+       sidelinedInformers  map[reflect.Type]cache.SharedIndexInformer
        // startedInformers is used for tracking which informers have been started.
        // This allows Start() to be called multiple times safely.
        startedInformers map[reflect.Type]bool
@@ -153,6 +155,16 @@ func (f *sharedInformerFactory) WaitForCacheSync(stopCh <-chan struct{}) map[ref
        }()

        res := map[reflect.Type]bool{}
+       toSideline := map[reflect.Type]cache.SharedIndexInformer{}
+       for informType, informer := range informers {
+               if informer.EventHandlerCount() == 0 {
+                       toSideline[informType] = informer
+               }
+       }
+       for informType, informer := range toSideline {
+               delete(f.informers, informType)
+               f.sidelinedInformers[informType] = informer
+       }
        for informType, informer := range informers {
                res[informType] = cache.WaitForCacheSync(stopCh, informer.HasSynced)
        }

This way, we don’t need to introduce stop channel for each informer. We would also need to add this method:

// removeListener returns true if the listener is removed. false otherwise
func (p *sharedProcessor) removeListener(listener *processorListener) bool {
	p.listenersLock.Lock()
	defer p.listenersLock.Unlock()
	index := -1
	for i := range p.listeners {
		if p.listeners[i] == listener {
			index = i
			break;
		}
	}
	if index >= 0 {
		p.listeners = append(p.listeners[:index], p.listeners[index+1:]...)
		p.syncingListeners = append(p.syncingListeners[:index], p.syncingListeners[index+1:]...)
		return true
	}
	return false
}

the PR #77480 has more work to do

  1. the interface is not structured the way we like it to be, @caesarxuchao raised the issue in the comments. as dynamic informer can be used outside resource quota, or garbage collector, it should have a generic interface.

  2. since the dynamic informer is shared, it can only be stopped when ‘safe’, i.e. all users of it are ‘aware’ that it is stopped, and no longer use it anymore after it is stopped.