kepler: error loading BPF program: invalid argument

What happened?

Hello, I’m trying to use Kepler now on a machine with access to the counters. But it seems not to be working. On my VMs, I can see it working with the estimations, but now that I’m deploying it in these new machines, I just see 0s as the measurements.

I tried to install Kepler by helm chart or by building it and applying the deployment file afterwords, but I had no success.

When I install it with helm, I can see the following logs:

> kubectl logs kepler-wdsjr -n monitoring
I0713 14:19:44.756647       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756672       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756696       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756735       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756764       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756791       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756817       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756846       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756873       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756911       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756939       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.756965       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.756990       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757054       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757081       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757113       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757140       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757167       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757196       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757225       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757255       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757283       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757322       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757348       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757373       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directoryI0713 14:19:44.757399       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory
I0713 14:19:44.757428       1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0xd6: no such file or directory

Then, when I query:

> kubectl exec -ti -n monitoring daemonset/kepler -- bash  -c "curl localhost:9102/metrics" | grep kepler_container_core_joules_total | grep wskow
# HELP kepler_container_core_joules_total Aggregated RAPL value in core in joules
# TYPE kepler_container_core_joules_total counter
kepler_container_core_joules_total{command="",container_id="13760f6f9db879378c267c91f6a6cec71a3111f8c9a73a39e457756702516919",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-1-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_id="13760f6f9db879378c267c91f6a6cec71a3111f8c9a73a39e457756702516919",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-1-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_id="32b2e453cae162b208d13859bdfc4b4726d22186de5fefe949759c6b4ee6b4af",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-2-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_id="32b2e453cae162b208d13859bdfc4b4726d22186de5fefe949759c6b4ee6b4af",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-2-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_id="808040772cbb31c91c4f4084f1a680c97c237879c83288831ed9614b05f1cb7c",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-9-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="808040772cbb31c91c4f4084f1a680c97c237879c83288831ed9614b05f1cb7c",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-9-guest-linpack"} 0
kepler_container_core_joules_total{command="",container_id="99ac37763d4a96b89c29e2e079796876fc2cb2d08d3febf54346ebafce0d6d96",container_name="user-action",container_namespace="openwhisk",mode="dynamic",pod_name="wskow-invoker-00-8-whisksystem-invokerhealthtestaction0"} 0
kepler_container_core_joules_total{command="",container_id="99ac37763d4a96b89c29e2e079796876fc2cb2d08d3febf54346ebafce0d6d96",container_name="user-action",container_namespace="openwhisk",mode="idle",pod_name="wskow-invoker-00-8-whisksystem-invokerhealthtestaction0"} 0

When I try to build by myself using make build-manifest OPTS="PROMETHEUS_DEPLOY" I can see in the logs:

> kubectl logs kepler-exporter-vrvmg -n monitoring
I0713 10:29:47.059485       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0713 10:29:47.065771       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0713 10:29:47.074301       1 exporter.go:151] Kepler running on version: 22f2c84
I0713 10:29:47.074322       1 config.go:212] using gCgroup ID in the BPF program: true
I0713 10:29:47.074352       1 config.go:214] kernel version: 4.19
I0713 10:29:47.074402       1 config.go:174] kernel source dir is set to /usr/share/kepler/kernel_sources
I0713 10:29:47.074449       1 exporter.go:171] EnabledBPFBatchDelete: true
I0713 10:29:47.074505       1 power.go:53] use sysfs to obtain power
I0713 10:29:47.604160       1 exporter.go:184] Initializing the GPU collector
I0713 10:29:47.604444       1 watcher.go:67] Using in cluster k8s config
modprobe: FATAL: Module kheaders not found in directory /lib/modules/4.19.0-24-amd64
chdir(/lib/modules/4.19.0-24-amd64/build): No such file or directory
I0713 10:29:47.710306       1 bcc_attacher.go:74] failed to attach the bpf program: <nil>
I0713 10:29:47.710331       1 bcc_attacher.go:143] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=32]: failed to attach the bpf program: <nil>, from default kernel source.
I0713 10:29:47.710357       1 bcc_attacher.go:146] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64
bpf: Failed to load program: Invalid argument

I0713 10:29:48.571949       1 bcc_attacher.go:150] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=32]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64"
I0713 10:29:48.571986       1 bcc_attacher.go:146] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64
bpf: Failed to load program: Invalid argument

I0713 10:29:49.392366       1 bcc_attacher.go:150] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=32]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64"
I0713 10:29:49.392431       1 bcc_attacher.go:158] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=32]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0713 10:29:49.392483       1 exporter.go:201] failed to start : failed to attach bpf assets: failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=32]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0713 10:29:49.392628       1 exporter.go:228] Started Kepler in 2.318348644s

What is weird is that it complains about the /lib/modules, which are installed in both of machines that I’m using:

> kubectl exec -ti debug-9trwq bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
root@paravance-49:/# nsenter --mount=/proc/1/ns/mnt -- sh -s
# ls /lib/modules
4.19.0-24-amd64
# ls /usr/lib/modules
4.19.0-24-amd64

> kubectl exec -ti debug-x6wlt bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
root@paravance-40:/# nsenter --mount=/proc/1/ns/mnt -- sh -s
# ls /lib/modules
4.19.0-24-amd64
# ls /usr/lib/modules
4.19.0-24-amd64

And finally, the result of the query is the same as above.

Can you help me, please?

PS: In fact, my goal is not to have the measurements from the real countes, I want to validate the Kepler’s estimation by crossing them with the powermeters that are installed in these machines. So, if possible, I would like to keep using the estimations. But I also don’t know how to specify that.

Can you help me to solve both issues (the main one and the PS), please?

Thank you very much!!

What did you expect to happen?

To get the estimations from Kepler.

How can we reproduce it (as minimally and precisely as possible)?

Following the commands I exemplified above.

Anything else we need to know?

No response

Kepler image tag

image: quay.io/sustainable_computing_io/kepler:latest

Kubernetes version

> kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:42:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or bare metal

bare metal

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
# paste output here

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} 

For standalone:

put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 31 (9 by maintainers)

Most upvoted comments

@andersonandrei can you share your yaml? or can you try this (generated from main branch)

kubectl apply -f https://gist.githubusercontent.com/rootfs/24f30eec07d955df9da1b10f8d403d8d/raw/98ab2d2ab93e85a1ed0d3f921f40a2ef013babcd/kepler-eks-bm.yaml

Agree to try this yaml.

RBAC looks good to me… Before changing the yaml, could you also share the result of following command to last confirm about rbac.

kubectl get clusterrolebinding kepler-clusterrole-binding -oyaml
kubectl auth can-i list pod --as system:serviceaccount:monitoring:kepler-sa

Is the log still showing?

W0714 14:02:12.989342       1 reflector.go:424] github.com/sustainable-computing-io/kepler/pkg/kubernetes/watcher.go:123: failed to list <unspecified>: 
pods is forbidden: User "system:serviceaccount:monitoring:kepler-sa" cannot list resource "pods" in API group "" at the cluster scope

@sunya-ch , I just changed back to my original yaml to do the checks you suggested. Here are the tests:

clusterrolebinding:

> kubectl get clusterrolebinding kepler-clusterrole-binding -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRoleBinding","metadata":{"annotations":{},"labels":{"sustainable-computing.io/app":"kepler"},"name":"kepler-clusterrole-binding"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"kepler-clusterrole"},"subjects":[{"kind":"ServiceAccount","name":"kepler-sa","namespace":"monitoring"}]}
  creationTimestamp: "2023-09-14T15:40:08Z"
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-clusterrole-binding
  resourceVersion: "33876"
  uid: d76b8c91-277a-471a-8848-5aa69fb92cce
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kepler-clusterrole
subjects:
- kind: ServiceAccount
  name: kepler-sa
  namespace: monitoring

Authorization:

> kubectl auth can-i list pod --as system:serviceaccount:monitoring:kepler-sa
yes

Kepler logs are not showing anymore the message ’ cannot list resource “pods” '. Here are the full logs:

> kubectl logs kepler-exporter-rtcb2 -n monitoring
I0914 15:40:11.639224       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0914 15:40:11.646136       1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0914 15:40:11.652823       1 exporter.go:158] Kepler running on version: 5f33240
I0914 15:40:11.652836       1 config.go:272] using gCgroup ID in the BPF program: true
I0914 15:40:11.652856       1 config.go:274] kernel version: 4.19
I0914 15:40:11.652907       1 config.go:299] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0914 15:40:11.652912       1 exporter.go:170] LibbpfBuilt: false, BccBuilt: true
I0914 15:40:11.652932       1 config.go:205] kernel source dir is set to /usr/share/kepler/kernel_sources
I0914 15:40:11.652980       1 exporter.go:189] EnabledBPFBatchDelete: true
I0914 15:40:11.653014       1 power.go:54] use sysfs to obtain power
I0914 15:40:11.653022       1 redfish.go:169] failed to get redfish credential file path
I0914 15:40:11.656419       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0914 15:40:11.680453       1 exporter.go:204] Initializing the GPU collector
I0914 15:40:17.686068       1 watcher.go:66] Using in cluster k8s config
modprobe: FATAL: Module kheaders not found in directory /lib/modules/4.19.0-25-amd64
chdir(/lib/modules/4.19.0-25-amd64/build): No such file or directory
I0914 15:40:17.814031       1 bcc_attacher.go:80] failed to attach the bpf program: <nil>
I0914 15:40:17.814043       1 bcc_attacher.go:159] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to attach the bpf program: <nil>, from default kernel source.
I0914 15:40:17.814053       1 bcc_attacher.go:162] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64
bpf: Failed to load program: Invalid argument

I0914 15:40:18.419320       1 bcc_attacher.go:166] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/4.18.0-477.13.1.el8_8.x86_64"
I0914 15:40:18.419346       1 bcc_attacher.go:162] trying to load eBPF module with kernel source dir /usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64
bpf: Failed to load program: Invalid argument

I0914 15:40:18.980164       1 bcc_attacher.go:166] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, from kernel source "/usr/share/kepler/kernel_sources/5.14.0-284.11.1.el9_2.x86_64"
I0914 15:40:18.980215       1 bcc_attacher.go:174] failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0914 15:40:18.980246       1 exporter.go:241] failed to start : failed to attach bpf assets: failed to attach perf module with options [-DMAP_SIZE=10240 -DNUM_CPUS=64 -DSAMPLE_RATE=0]: failed to load kprobe__finish_task_switch: error loading BPF program: invalid argument, not able to load eBPF modules
I0914 15:40:18.980393       1 container_energy.go:109] Using the Ratio/DynPower Power Model to estimate Container Platform Power
I0914 15:40:18.980403       1 container_energy.go:118] Using the Ratio/DynPower Power Model to estimate Container Component Power
I0914 15:40:18.980430       1 process_power.go:108] Using the Ratio/DynPower Power Model to estimate Process Platform Power
I0914 15:40:18.980445       1 process_power.go:117] Using the Ratio/DynPower Power Model to estimate Process Component Power
I0914 15:40:18.980608       1 node_platform_energy.go:53] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power
I0914 15:40:18.980798       1 exporter.go:276] Started Kepler in 7.327992991s

@andersonandrei Could you share the result of

kubectl get clusterrole kepler-clusterrole -o yaml

The pod should be added to the resources list by bc981ed for apiserver update.

If pods is not there, you can just manually add it to the list and restart the pod.

@sunya-ch , here is the output of the command:

kubectl get clusterrole kepler-clusterrole -oyaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"labels":{"sustainable-computing.io/app":"kepler"},"name":"kepler-clusterrole"},"rules":[{"apiGroups":[""],"resources":["nodes/metrics","nodes/proxy","nodes/stats","pods"],"verbs":["get","watch","list"]}]}
  creationTimestamp: "2023-09-14T09:37:35Z"
  labels:
    sustainable-computing.io/app: kepler
  name: kepler-clusterrole
  resourceVersion: "2314"
  uid: 47353cf2-9466-457d-b4eb-71449333fe83
rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  - nodes/proxy
  - nodes/stats
  - pods
  verbs:
  - get
  - watch
  - list