kubernetes: CRD age does not change from as soon as it becomes Established

Is this a BUG REPORT or FEATURE REQUEST?: /kind bug

What happened: When deploying https://github.com/istio/istio/blob/master/install/kubernetes/istio.yaml on minikube, I get the following errors:

unable to recognize "install/kubernetes/istio.yaml": no matches for config.istio.io/, Kind=attributemanifest

for each of the custom resources specified in https://github.com/istio/istio/blob/master/install/kubernetes/istio.yaml

Running kubernetes get crd returns:

NAME                                  AGE
attributemanifests.config.istio.io    <invalid>

After several seconds, kubernetes get crd returns valid age for the CRDs. Then the custom resources are added successfully.

What you expected to happen: I expect that it should be explicitly documented: can custom resources be added immediately after their CRDs? Or the user should wait until kubernetes get crd returns valid age.

How to reproduce it (as minimally and precisely as possible): Deploy https://github.com/istio/istio/blob/master/install/kubernetes/istio.yaml on a minikube with limited resources, so it will take time for the CRDs to become valid.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.8.0
  • Cloud provider or hardware configuration: Minikube 0.23.0
  • OS (e.g. from /etc/os-release): Minikube on Virtual Box on Mac OS
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 34 (24 by maintainers)

Most upvoted comments

Working on reproducing the issue and implementing tests for this case.

I’ve worked with @sttts on reproducing the issue and finding the root of the problem. I’ll try my best to explain what have I found.

It’s hard to define what limited resources are. In the most of cases, everything worked flawlessly. To be able to reproduce the issue, I had to put my local machine (which was running the cluster via ./hack/local-cluster-up.sh) under the heavy load (100% CPU and a lot of I/O operations). To accomplish this I was using the stress command.

Under normal conditions everything is working as expected, age is valid, and CRs are created as well. Under heavy load, it can happen that CRs fail to create, but it’s very hard to reproduce the age <invalid>.

It is expected for CRDs to take some time to get fully created/provisioned, but in normal conditions, that doesn’t make a problem. If cluster has limited resources, this can be a problem, but there could be a possible fix that we can discuss.

About the <invalid> age… the age is parsed/set by the ShortHumanDuration function under the duration package. That happens only if there’s some deviation between machine times.

This function is invoked by the translateTimestamp function with now-creationTimestamp.

Because the translateTimestamp function checks is time zero (and if yes it returns <unknown> instead of <invalid>), we are sure some creation timestamp is returned by the API.

To further analyze it, I’ve tried to see does API return indeed the creationTimestamp, even under the load, and yes, it does. The time is set by the BeforeCreate function, specifically, by the FillObjectMetaSystemFields function, which is defined in the same package. The FillObjectMetaSystemFields function sets the creationTimestamp to Now().

This Playground is from the beginning of my research, but it can contain some useful information, along with a test that confirm that creationTimestamp is not zero.

To fix the problem with CR creation failure, @sttts proposed the following solution when I was talking to him about the issue:

there is a fix for the race: now we have this: 1) CRD is created. 2) CRD names are checked and validated. 3) CRD is set to “Established”. 4) The http handler sees the “Established” conditation on the CRD and 5) starts to serve it. The race is between 3 and 5, when the user observes 3, but 5 is not done yet. Instead we have to turn 3 and 5 around. i.e. when the handler starts serving it it should set the Established conditation. For that we need another signal in 3. Maybe we already have a signal “NoNameConflict” or something like that could be used here.

I would also like to mention, that the age <invalid> is not limited to CRDs. I had the same problem with the local cluster, but with pods instead the CRDs. The following command: kubectl get pods -w --all-namespaces returned:

kube-system   kube-dns-86f6f55dd5-8kbzn   0/3       Pending   0         <invalid>
kube-system   kube-dns-86f6f55dd5-8kbzn   0/3       ContainerCreating   0         <invalid>
kube-system   kube-dns-86f6f55dd5-8kbzn   2/3       Running   0         <invalid>
kube-system   kube-dns-86f6f55dd5-8kbzn   0/3       Evicted   0         37s
kube-system   kube-dns-86f6f55dd5-mgfng   0/3       Pending   0         <invalid>
kube-system   kube-dns-86f6f55dd5-mgfng   0/3       Pending   0         <invalid>
kube-system   kube-dns-86f6f55dd5-mgfng   0/3       ContainerCreating   0         <invalid>
kube-system   kube-dns-86f6f55dd5-mgfng   2/3       Running   0         <invalid>

Just to note, the stress command was running while the cluster was provisioning, so the local machine had limited resources.

And of course there is no connnection to Minikube, but it is like that by design on every cluster.

@vadimeisenbergibm

So what are the users of kubectl advised to do with regard to new CRDs and their resources? Define CRDs first, wait several seconds and then add resources? Or just add new CRDS and their resources, and in case of failure retry? I would like this advice to be documented somewhere.

I’m not sure what our guidance is beyond waiting. You could also programmatically poll/watch the CRD and wait for it to be established. cc @kubernetes/sig-api-machinery-misc for increased visibility & guidance

By which command can I check this?

Here is a sample I just ran:

$ get crd/backups.ark.heptio.com -o jsonpath='{.status.conditions[?(.type == "Established")].status}{"\n"}'
True