crossplane: crossplane core fails to start up : "cannot apply crd"
What happened?
After upgrading from 1.7.0 to 1.12.2 the crossplane core doesn’t startup. We see error messages like
crossplane: error: cannot initialize core: cannot apply crd:
cannot patch object: CustomResourceDefinition.apiextensions.k8s.io
"compositionrevisions.apiextensions.crossplane.io" is invalid:
status.storedVersions[1]: Invalid value: "v1": must appear in spec.versions
How can we reproduce it?
TBD. We have at least 1 deploy where the problem didn’t happen and 1 where it did. They both went through the same upgrade of Crossplane. However the one that is showing the problem is actually at least a couple years old and has gone through upgrades of k8s while the other was just created within the last month.
NOTE: We saw a similar situation with service-catalog CRD in May of last year. What I am saying is, while the symptoms were very different back then the determining factor for which deployments did or did not show the symptoms was whether k8s itself was an ‘old’ deploy that had been upgraded or if it was a ‘young’ deploy which had not been upgraded.
What environment did it happen in?
Crossplane version: 1.12.2 k8s: 1.25.10 <!-- will try to get history of k8s upgrades k8s distro : aws <!-- Teammate is checking how widespread this problem is, could be on others as well os: Unavailable at this time kernel: Unavailable at this time
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 40 (36 by maintainers)
Wow @dee0sap that process sounds like the perfect stress test for your control plane and all of your deployments… 😅
I can confirm that 1.13.1 fixed the issue I had upgrading from 1.12. 🎉
FYI: just cut
v1.13.1
, we are still working on the release notes, but you should already be able to install it. 🙏@dee0sap we are planning the v1.13.1 patch with the fix.
I believe, all you would need to is to upgrade to that version so that the newly introduced migratory could take care of the problem. @sttts or @phisco can correct me if I am wrong.
Yeah, the cluster I tried upgrading today has been running Crossplane forever so it gets to be the canary for issues like these… 😅
just a note we dropped v1alpha1 for lock in
v1.11.0
https://github.com/crossplane/crossplane/pull/3479No it won’t fix the locks issue. Can you open a dedicated issue for that?
Ok… good news I think. I manage to get logs and kube_pod_container_info metrics from grafana . Just looking at the timeframe when the v1 error was occuring and just looking at init container logs I see that actually a 1.7.0 init container was logging the “v1” must appear error while a 1.12.2 init container was logging the “v1alpha1” must appear error.
Not 100% sure why both init containers would be running at the same time but I can guess…
The cluster autoscaler is set up for these clusters. And our rollouts our goofy. Basically we have a process that makes 1 big yaml with absolutely everything for the cluster in it and that is passed to kubectl apply. Which means absolutely everything is changing at once. And there is 1 service in particular that has 10s ( hundreds? ) of workloads that can be updated in the process. This means that the cluster autoscaler goes into overdrive adding nodes and replacing pods as it tries to redistribute the workload across the nodes. ( And everything is put into the same node work group btw, just to maximize churn 😃 ) I could imagine that the autoscaler tried to replace the 1.7.0 pod on node X with a new pod on node Y while at the same time trying to spin up a 1.12.2 pod on node Z. And then hilarity ensued.
What do you guys think about this hypothesis? Could this actually explain what we have observed?
( Btw @haarchri I won’t be able run those kubectl commands… However I believe the above probably gives you what you’re looking for. )
Yeah I am trying to collect the log data from grafana now
i can reproduce the issue with my test: https://github.com/haarchri/crossplane-issue-4400
full log output: https://github.com/haarchri/crossplane-issue-4400/blob/main/log.txt#L183