rancher: 2.4.3 upgrade monitoring to 0.1.0 fails
Just upgraded Rancher to 2.4.3 from 2.3.6. The clusters I have were previously running on 1.7.4 and I have upgraded them to 1.7.5. It indicated there is a monitoring upgrade to 0.1.0. I was previously running 0.7.0 without any issues. However, when proceeded with the upgrade, the prometheus-cluster-monitoring
statefulset never comes up on and is constantly created then removed. It’s not quite clear what the issue is. All the sidecars in the pod comes up healthy except prometheus-agent
. I’m running CentOS 7 VMs and Docker 19.3.1/19.3.8.
I am using persistent storage. Of the clusters I have that I tried performing the upgrade, one is using Rook/Ceph and one is on AWS using EBS. Both exhibit the same exact problem. However, if I turn off persistent storage it comes online fine.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 6
- Comments: 37 (17 by maintainers)
Reproduce Step
It’s really easy to reproduce in Rancher
v2.4.3
with k8sv1.17.5
and monitoringv0.1.0
. Just enable cluster monitoring with PV enabled.Reason
Tried Rancher monitoring
0.0.7
, same problem.Here is the change between v0.0.7 and v0.1.0 https://github.com/rancher/system-charts/pull/204/commits/32907f0b63f6c30f1574247696525d7c77d3c864
The change is small and looks good.
It looks like it’s related to Rancher version and Prometheus Operator version.
As we introduced https://github.com/rancher/rancher/commit/ac7471754e6301262dfc2a696195d5bd099cbf8f in 2.4, it will cause a bug in Prometheus Operator which has been fixed after 0.32.0 https://github.com/coreos/prometheus-operator/commit/8b8a2470726df00f1e8042ac5b16161a81261e1b
Workaround
Option 1:
Enable cluster monitoring without PV enabled
Option 2:
system-library
to usehttps://github.com/loganhz/system-charts
with branchfix
.Related fix commit https://github.com/loganhz/system-charts/commit/a41ce9f9e65778efc5cb05669486173449b1ee6c
Option 3:
View Edit YAML
ofprometheus-operator-monitoring-operator
- --crd-apigroup=monitoring.coreos.com
quay.io/coreos/prometheus-operator:v0.38.1
Save
Solutions
Option 1:
Upgrade Prometheus Operator version
At least, we need to upgrade Prometheus Operator version.
Maybe we can bump component versions for Rancher monitoring such as prometheus, operator, exporter, configmap-reload and so on.
You can find the whole image list via https://github.com/rancher/system-charts/blob/release-v2.4/charts/rancher-monitoring/v0.1.0/values.yaml
We need to verify monitoring and alerting in many cases such as different K8s version, PV enabled nor not, upgrading monitoring from old monitoring version.
Option 2:
Revert https://github.com/rancher/rancher/commit/ac7471754e6301262dfc2a696195d5bd099cbf8f or a better fix for it.
There is no re-try logic in our catalog app controller. It means in some case, monitoring might not be installed successfully. So we introduced the fix to force our catalog app controller to re-install the cluster monitoring app https://github.com/rancher/rancher/issues/26440.
However, this fix cause the current issue. It will update the Prometheus CRD with a new
storage.volumeClaimTemplate.metadata.creationTimestamp
value when Rancher controller force deploy monitoring app. The field will be ignored by Prometheus Operator 0.38. But in Prometheus Operator 0.32, Prometheus Operator will remove and redeploy Prometheus StatefulSet… Then rancher controller will notice Prometheus StatefulSet is gone. It will force update cluster monitoring again. So Prometheus Operator will remove and redeploy it again… This just happens again and again. The force re-deploy logic should be refined. It’s so aggressive nowOption 3:
Maybe we can hard code
creationTimestamp
value for https://github.com/rancher/system-charts/blob/dev-v2.5/charts/rancher-monitoring/v0.1.0/charts/prometheus/templates/prometheus.yaml#L154 So it will never changed. It’s easy but not good as it will have a fixed creationTimestamp.Thanks @jiaqiluo I am trying to enable monitoring without a pv as workaround since I don’t really need persistence I think. But I am hitting another issue now with monitoring because my cloud provider controller manager unsets the internal IPs of the nodes causing various issues with metrics etc…
Hi @vitobotta
the new monitoring chart
0.1.1
is available forrancher:v2.4.5
@jiaqiluo I’m on rancher 2.4.4 but I cannot see the monitoring version 0.1.1. Is there a way to force it?