harvester: [BUG] Monitoring pod OOM after run a period of time

Describe the bug After running the Prometheus pod for a period of time e.g 9 days, the monitoring dashboard is empty due to the Prometheus pod being killed by OOM.

To Reproduce

ks get pod
NAME                                                     READY   STATUS             RESTARTS   AGE
prometheus-rancher-monitoring-prometheus-0               2/3     CrashLoopBackOff   8          19m

image

Expected behavior

Support bundle

Environment:

  • Harvester ISO version: v0.3.0
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630):

Additional context Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 34 (20 by maintainers)

Most upvoted comments

Validate Test Scenario 2 with Harvester master-9664bf67-head. Following test cases can pass.

Case1: prometheus-node-exporter pods don’t crash

  1. Create a 3-node cluster.
  2. Create a VM.
  3. Open the VM Metrics page and stay on the page for hours.
  4. We can see metrics on the page and the page doesn’t crash.
  5. Pods in rancher-monitoring-prometheus-node-exporter DaemonSet don’t crash.

Case2: Update prometheus-node-exporter in rancher-monitoring ManagedChart

  1. Update prometheus-node-exporter.resources.limits.memory from 180Mi to 200Mi.
  2. Check whether resources.limits.memory in rancher-monitoring-prometheus-node-exporter DaemonSet is 200Mi.

According to the investigation https://github.com/harvester/harvester/issues/1531#issuecomment-1098545912. The Prometheus monitoring pod is running without crash and the monitoring-node-exporter did not encounter restarts as well.

When abnormal, front end is stucking in “Loading…”, but there are no ongoing/failure http requests. To narrow down the scope, we will close this issue and continue tracking in #2150

@w13915984028, @guangbochen Double confirm this issue on harvester v1.0.1, still encounter the monitoring chart empty sympton. Please check and advice whether we should look deep into this issue.

  • During the 4 hours period checking, the monitoring chart still encounter empty several times like the following image

  • The Prometheus pod running as expected, the node-exporter pod did not restart

rancher@harvester-node-2:~> sudo -i kubectl get pods -n cattle-monitoring-system
NAME                                                     READY   STATUS    RESTARTS   AGE
prometheus-rancher-monitoring-prometheus-0               3/3     Running   0          4h52m
rancher-monitoring-grafana-d9c56d79b-kcjwh               3/3     Running   0          4h52m
rancher-monitoring-kube-state-metrics-5bc8bb48bd-49l2q   1/1     Running   0          4h52m
rancher-monitoring-operator-559767d69b-sq58h             1/1     Running   0          4h52m
rancher-monitoring-prometheus-adapter-8846d4757-d65xh    1/1     Running   0          4h52m
rancher-monitoring-prometheus-node-exporter-d7xsn        1/1     Running   0          4h52m
rancher-monitoring-prometheus-node-exporter-lxrrv        1/1     Running   0          4h24m
rancher-monitoring-prometheus-node-exporter-pspnn        1/1     Running   0          4h39m
  • Use the default monitoring setting
 {
  "evaluationInterval": "1m",
  "resources": {
    "limits": {
      "cpu": "1000m",
      "memory": "2500Mi"
    },
    "requests": {
      "cpu": "750m",
      "memory": "1750Mi"
    }
  },
  "retention": "5d",
  "retentionSize": "50GiB",
  "scrapeInterval": "1m",
  "storageSpec": {
    "volumeClaimTemplate": {
      "spec": {
        "accessModes": [
          "ReadWriteOnce"
        ],
        "resources": {
          "requests": {
            "storage": "50Gi"
          }
        },
        "storageClassName": "longhorn",
        "volumeMode": "Filesystem"
      }
    }
  }
}

Support Bundle

supportbundle_8c4a93cf-b167-4ef8-b4d4-acb17de158fe_2022-04-13T13-15-14Z.zip

I think there will be two approaches to update the related chart configs, can we please help to verify them accordingly and choose the most feasible one: