kepler: Failed to open path to msr and to attach perf events on VMs
Describe the bug I’m working with Azure AK8 (VMs) and I’m getting errors related to msr and perf events. Also, even when I see a few metric values, they are mostly 0.
To Reproduce Steps to reproduce the behavior:
- Deployed it manually, using one of the manifest available, with slight modifications:
kind: Namespace
metadata:
labels:
sustainable-computing.io/app: kepler
name: kepler
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
sustainable-computing.io/app: kepler
name: kepler-sa
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
sustainable-computing.io/app: kepler
name: prometheus-k8s
namespace: monitoring
rules:
- apiGroups:
- ""
resources:
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
sustainable-computing.io/app: kepler
name: kepler-clusterrole
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
- nodes/proxy
- nodes/stats
verbs:
- get
- watch
- list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
sustainable-computing.io/app: kepler
name: prometheus-k8s
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: prometheus-k8s
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
sustainable-computing.io/app: kepler
name: kepler-clusterrole-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kepler-clusterrole
subjects:
- kind: ServiceAccount
name: kepler-sa
namespace: monitoring
---
apiVersion: v1
data:
BIND_ADDRESS: 0.0.0.0:9102
CGROUP_METRICS: '*'
CPU_ARCH_OVERRIDE: ""
ENABLE_EBPF_CGROUPID: "true"
ENABLE_GPU: "true"
KEPLER_LOG_LEVEL: "1"
KEPLER_namespace: monitoring
METRIC_PATH: /metrics
MODEL_CONFIG: |
CONTAINER_COMPONENTS_ESTIMATOR=false
CONTAINER_COMPONENTS_INIT_URL=https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CgroupOnly/ScikitMixed/ScikitMixed.json
kind: ConfigMap
metadata:
labels:
sustainable-computing.io/app: kepler
name: kepler-cfm
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
labels:
app: kepler-exporter-service
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
name: kepler-exporter
namespace: monitoring
spec:
ports:
- name: http
port: 9102
targetPort: http
selector:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app: kepler-exporter-service
sustainable-computing.io/app: kepler
name: kepler-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: kepler-exporter-service
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
template:
metadata:
labels:
app: kepler-exporter-service
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
spec:
containers:
- args:
- /usr/bin/kepler -v=1
command:
- /bin/sh
- -c
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
image: quay.io/sustainable_computing_io/kepler:latest
imagePullPolicy: Always
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthz
port: 9102
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 10
name: kepler-exporter
ports:
- containerPort: 9102
name: http
resources:
requests:
cpu: 100m
memory: 400Mi
securityContext:
privileged: true
volumeMounts:
- mountPath: /lib/modules
name: lib-modules
- mountPath: /sys
name: tracing
- mountPath: /proc
name: proc
- mountPath: /etc/config
name: cfm
readOnly: true
dnsPolicy: ClusterFirstWithHostNet
serviceAccountName: kepler-sa
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
volumes:
- hostPath:
path: /lib/modules
type: Directory
name: lib-modules
- hostPath:
path: /sys
type: Directory
name: tracing
- hostPath:
path: /proc
type: Directory
name: proc
- configMap:
name: kepler-cfm
name: cfm
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app: prometheus-operator
release: prometheus
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler-exporter
sustainable-computing.io/app: kepler
name: kepler-exporter
namespace: monitoring
spec:
endpoints:
- interval: 1s
port: http
namespaceSelector:
matchNames:
- ryaxns-monitoring
selector:
matchLabels:
app: kepler-exporter-service
- See error
I0301 14:42:12.695339 1 gpu_nvml.go:45] could not init nvml: <nil>
Failed to init nvml: could not init nvml: <nil>, using dummy source to obtain gpu power
I0301 14:42:12.708031 1 exporter.go:150] Kepler running on version: 0d3e6ce
I0301 14:42:12.708117 1 config.go:153] using gCgroup ID in the BPF program: true
I0301 14:42:12.708205 1 config.go:154] kernel version: 5.4
I0301 14:42:12.708271 1 config.go:172] EnabledGPU: true
I0301 14:42:12.708435 1 rapl_msr_util.go:143] failed to open path /dev/cpu/1/msr: no such file or directory
I0301 14:42:12.708526 1 power.go:64] Not able to obtain power, use estimate method
I0301 14:42:12.711548 1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
I0301 14:42:13.016250 1 exporter.go:168] Initializing the GPU collector
perf_event_open: No such file or directory
I0301 14:42:15.542780 1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542866 1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542922 1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542990 1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
I0301 14:42:15.543017 1 bcc_attacher.go:132] Successfully load eBPF module with option: [-DNUM_CPUS=2]
I0301 14:42:15.601337 1 exporter.go:210] Started Kepler in 2.89332793s
- Possible way around for the msr problem:
I was able to “solve” the problem of the path to msr, deploying a simple pod, getting inside it within a terminal, and installing msr manually with apt-get update -y
apt-get install msr-tools
modprobe msr
. Besides, I’m not sure if it is the correct way to do that, and the other errors persist.
- More details: Even with the errors above, I get a few metrics values, but still, the majority is 0.
Expected behavior To not see those errors and to get metrics of energy.
Azure AK8 (VMs):
- OS: Ubuntu 20.04.5
- Kernel: 5.14.0-1057-oem
Additional context Also, Kepler does not export metrics of new pods. I can just see the metrics for those pods that were already in the platform before Kepler’s deployment. I need to re-deploy Kepler to see the metrics of such new pods.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 19 (8 by maintainers)
@andersonandrei
msr
error is benign: if msr cannot be accessed (typical on VMs), the power calculation model is then switched to linear regression based method.What metrics can you see, can you post them here? In addition, can you change the verbosity to 5 (like below) and share the log?
from those log I am wondering 1) you didn’t enable cgroupv2 2) you didn’t enable eBPF seems you are using Azure VM which I don’t know those can be enabled or not
I reported this before but I think your version is
0d3e6ce
which is really new version ,so can you help report with more detail on this in another isue? @andersonandrei