kepler: Failed to open path to msr and to attach perf events on VMs

Describe the bug I’m working with Azure AK8 (VMs) and I’m getting errors related to msr and perf events. Also, even when I see a few metric values, they are mostly 0.

To Reproduce Steps to reproduce the behavior:

  1. Deployed it manually, using one of the manifest available, with slight modifications:
kind: Namespace
metadata:
  labels:
    sustainable-computing.io/app: kepler
  name: kepler
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-sa
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    sustainable-computing.io/app: kepler
  name: prometheus-k8s
  namespace: monitoring
rules:
- apiGroups:
  - ""
  resources:
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-clusterrole
rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  - nodes/proxy
  - nodes/stats
  verbs:
  - get
  - watch
  - list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    sustainable-computing.io/app: kepler
  name: prometheus-k8s
  namespace: monitoring
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: prometheus-k8s
subjects:
- kind: ServiceAccount
  name: prometheus-k8s
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-clusterrole-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kepler-clusterrole
subjects:
- kind: ServiceAccount
  name: kepler-sa
  namespace: monitoring
---
apiVersion: v1
data:
  BIND_ADDRESS: 0.0.0.0:9102
  CGROUP_METRICS: '*'
  CPU_ARCH_OVERRIDE: ""
  ENABLE_EBPF_CGROUPID: "true"
  ENABLE_GPU: "true"
  KEPLER_LOG_LEVEL: "1"
  KEPLER_namespace: monitoring
  METRIC_PATH: /metrics
  MODEL_CONFIG: |
    CONTAINER_COMPONENTS_ESTIMATOR=false
    CONTAINER_COMPONENTS_INIT_URL=https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CgroupOnly/ScikitMixed/ScikitMixed.json
kind: ConfigMap
metadata:
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-cfm
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: kepler-exporter-service
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kepler-exporter
    sustainable-computing.io/app: kepler
  name: kepler-exporter
  namespace: monitoring
spec:
  ports:
  - name: http
    port: 9102
    targetPort: http
  selector:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kepler-exporter
    sustainable-computing.io/app: kepler
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: kepler-exporter-service
    sustainable-computing.io/app: kepler
  name: kepler-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: kepler-exporter-service
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: kepler-exporter
      sustainable-computing.io/app: kepler
  template:
    metadata:
      labels:
        app: kepler-exporter-service
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: kepler-exporter
        sustainable-computing.io/app: kepler
    spec:
      containers:
      - args:
        - /usr/bin/kepler -v=1
        command:
        - /bin/sh
        - -c
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        image: quay.io/sustainable_computing_io/kepler:latest
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /healthz
            port: 9102
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 10
        name: kepler-exporter
        ports:
        - containerPort: 9102
          name: http
        resources:
          requests:
            cpu: 100m
            memory: 400Mi
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /lib/modules
          name: lib-modules
        - mountPath: /sys
          name: tracing
        - mountPath: /proc
          name: proc
        - mountPath: /etc/config
          name: cfm
          readOnly: true
      dnsPolicy: ClusterFirstWithHostNet
      serviceAccountName: kepler-sa
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      volumes:
      - hostPath:
          path: /lib/modules
          type: Directory
        name: lib-modules
      - hostPath:
          path: /sys
          type: Directory
        name: tracing
      - hostPath:
          path: /proc
          type: Directory
        name: proc
      - configMap:
          name: kepler-cfm
        name: cfm
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app: prometheus-operator
    release: prometheus
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kepler-exporter
    sustainable-computing.io/app: kepler
  name: kepler-exporter
  namespace: monitoring
spec:
  endpoints:
  - interval: 1s
    port: http
  namespaceSelector:
    matchNames:
    - ryaxns-monitoring
  selector:
    matchLabels:
      app: kepler-exporter-service
  1. See error

I0301 14:42:12.695339       1 gpu_nvml.go:45] could not init nvml: <nil>
Failed to init nvml: could not init nvml: <nil>, using dummy source to obtain gpu power
I0301 14:42:12.708031       1 exporter.go:150] Kepler running on version: 0d3e6ce
I0301 14:42:12.708117       1 config.go:153] using gCgroup ID in the BPF program: true
I0301 14:42:12.708205       1 config.go:154] kernel version: 5.4
I0301 14:42:12.708271       1 config.go:172] EnabledGPU: true
I0301 14:42:12.708435       1 rapl_msr_util.go:143] failed to open path /dev/cpu/1/msr: no such file or directory
I0301 14:42:12.708526       1 power.go:64] Not able to obtain power, use estimate method
I0301 14:42:12.711548       1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
I0301 14:42:13.016250       1 exporter.go:168] Initializing the GPU collector
perf_event_open: No such file or directory
I0301 14:42:15.542780       1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542866       1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542922       1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542990       1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
I0301 14:42:15.543017       1 bcc_attacher.go:132] Successfully load eBPF module with option: [-DNUM_CPUS=2]
I0301 14:42:15.601337       1 exporter.go:210] Started Kepler in 2.89332793s
  1. Possible way around for the msr problem:

I was able to “solve” the problem of the path to msr, deploying a simple pod, getting inside it within a terminal, and installing msr manually with apt-get update -y apt-get install msr-tools modprobe msr. Besides, I’m not sure if it is the correct way to do that, and the other errors persist.

  1. More details: Even with the errors above, I get a few metrics values, but still, the majority is 0.

Expected behavior To not see those errors and to get metrics of energy.

Azure AK8 (VMs):

  • OS: Ubuntu 20.04.5
  • Kernel: 5.14.0-1057-oem

Additional context Also, Kepler does not export metrics of new pods. I can just see the metrics for those pods that were already in the platform before Kepler’s deployment. I need to re-deploy Kepler to see the metrics of such new pods.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 19 (8 by maintainers)

Most upvoted comments

@andersonandrei msr error is benign: if msr cannot be accessed (typical on VMs), the power calculation model is then switched to linear regression based method.

What metrics can you see, can you post them here? In addition, can you change the verbosity to 5 (like below) and share the log?

containers:
      - args:
        - /usr/bin/kepler -v=5

from those log I am wondering 1) you didn’t enable cgroupv2 2) you didn’t enable eBPF seems you are using Azure VM which I don’t know those can be enabled or not

I0301 14:42:12.708526       1 power.go:64] Not able to obtain power, use estimate method
I0301 14:42:12.711548       1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
I0301 14:42:13.016250       1 exporter.go:168] Initializing the GPU collector
perf_event_open: No such file or directory
I0301 14:42:15.542780       1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542866       1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542922       1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542990       1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory

Also, Kepler does not export metrics of new pods. I can just see the metrics for those pods that were already in the platform before Kepler’s deployment. I need to re-deploy Kepler to see the metrics of such new pods.

I reported this before but I think your version is 0d3e6ce which is really new version ,so can you help report with more detail on this in another isue? @andersonandrei