kepler: Unable to access Kepler metrics from prometheus
What happened?
Unable to export metrics to prometheus. Kepler appears to be "Dropped"
from prometheus’s /service-discovery
page.
I’m reporting this problem from previous issue @ How to select the tag of kepler-helm-chart to install Kepler? from Kepler Helm deployement, following the initiative of @rootfs and @LAI-chuchi
However, metrics are available on port 9102 of the kepler pod at http://<cluster-url>:9102/metrics.
I’m using Kind for my K8s cluster, with Prometheus and Grafana already deployed. I also use Cilium without any problems.
What did you expect to happen?
Get Kepler metrics enable on my prometheus queries dashbord.
How can we reproduce it (as minimally and precisely as possible)?
Linux 22.04 Kernel: 5.5.0-050500-generic kind version 0.18.0*
Anything else we need to know?
Log from kepler pod:
I0706 13:50:55.221651 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0706 13:50:55.416231 1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0706 13:50:55.714345 1 exporter.go:151] Kepler running on version: 364c44f
I0706 13:50:55.714388 1 config.go:212] using gCgroup ID in the BPF program: true
I0706 13:50:55.714445 1 config.go:214] kernel version: 5.5
I0706 13:50:55.714484 1 exporter.go:171] EnabledBPFBatchDelete: true
I0706 13:50:55.714602 1 power.go:53] use sysfs to obtain power
I0706 13:51:03.936389 1 exporter.go:184] Initializing the GPU collector
I0706 13:51:03.937016 1 watcher.go:67] Using in cluster k8s config
I0706 13:51:30.137397 1 bcc_attacher.go:171] Successfully load eBPF module with option: [-DMAP_SIZE=10240 -DNUM_CPUS=16]
I0706 13:51:30.717792 1 exporter.go:228] Started Kepler in 35.003475726s
I0706 13:51:33.720063 1 container_hc_collector.go:130] failed to get bpf table elemets, err: failed to batch get: invalid argument
I0706 13:51:33.728652 1 container_hc_collector.go:211] resetting EnabledBPFBatchDelete to false
I0706 13:51:36.914588 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.914707 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.914739 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.914800 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.914850 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.914881 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.914949 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.914985 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915014 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915042 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915070 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915103 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915132 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915160 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915188 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915216 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915243 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915272 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915300 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915328 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915356 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915383 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915411 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
I0706 13:51:36.915442 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
[...] (goes on forever)
The metrics from my pod (http://172.18.0.2:9102/metrics):
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 5.4665e-05
go_gc_duration_seconds{quantile="0.25"} 9.1751e-05
go_gc_duration_seconds{quantile="0.5"} 0.000116279
go_gc_duration_seconds{quantile="0.75"} 0.000145458
go_gc_duration_seconds{quantile="1"} 0.100348473
go_gc_duration_seconds_sum 7.546713807
go_gc_duration_seconds_count 919
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 20
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.18.1"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 7.207032e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 3.389494576e+09
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.537498e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 5.0197247e+07
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 5.604088e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 7.207032e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 6.864896e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 8.470528e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 49085
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 4.243456e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 1.5335424e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.6886531940142617e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 5.0246332e+07
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 19200
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 31200
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 267920
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 375360
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 9.534512e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 3.28271e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.441792e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.441792e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 2.7608072e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 21
# HELP kepler_container_package_joules_total Aggregated RAPL value in package (socket) in joules
# TYPE kepler_container_package_joules_total counter
kepler_container_package_joules_total{command="",container_id="",container_name="upf",container_namespace="core5g-sys",mode="dynamic",pod_name="core5g-free5gc-upf-upf-6cb555dd9-fcl68"} 0
kepler_container_package_joules_total{command="",container_id="",container_name="upf",container_namespace="core5g-sys",mode="idle",pod_name="core5g-free5gc-upf-upf-6cb555dd9-fcl68"} 1113.33
kepler_container_package_joules_total{command="",container_id="0148e9f8e423c9eee15a211f75f7fda26e946d3a803d145c0ad51192685990e1",container_name="amf",container_namespace="core5g-sys",mode="dynamic",pod_name="core5g-free5gc-amf-amf-d67fc97bf-z5qwf"} 0
kepler_container_package_joules_total{command="",container_id="0148e9f8e423c9eee15a211f75f7fda26e946d3a803d145c0ad51192685990e1",container_name="amf",container_namespace="core5g-sys",mode="idle",pod_name="core5g-free5gc-amf-amf-d67fc97bf-z5qwf"} 1113.33
kepler_container_package_joules_total{command="",container_id="0164b39138b37d4d36334522c0f862cf5ed610a0cb162371bfb286a9f891497c",container_name="install-cni-binaries",container_namespace="kube-system",mode="dynamic",pod_name="cilium-kjj2f"} 0
kepler_container_package_joules_total{command="",container_id="0164b39138b37d4d36334522c0f862cf5ed610a0cb162371bfb286a9f891497c",container_name="install-cni-binaries",container_namespace="kube-system",mode="idle",pod_name="cilium-kjj2f"} 1113.33
kepler_container_package_joules_total{command="",container_id="0267587b6b0919ac7297dcedf9378831c8d0538e0d2686608b3f1d92ec617151",container_name="wait-mongo",container_namespace="core5g-sys",mode="dynamic",pod_name="core5g-free5gc-nrf-nrf-f89c6b99b-ntdkj"} 0
kepler_container_package_joules_total{command="",container_id="0267587b6b0919ac7297dcedf9378831c8d0538e0d2686608b3f1d92ec617151",container_name="wait-mongo",container_namespace="core5g-sys",mode="idle",pod_name="core5g-free5gc-nrf-nrf-f89c6b99b-ntdkj"} 1113.33
kepler_container_package_joules_total{command="",container_id="0ab33bc2b900f87b20a41c4f848f29314ba61eda210ba66bce841734136038ef",container_name="prometheus-server-configmap-reload",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-server-
[...]
Kepler image tag:
kepler-0.4.3 release-0.5.1
Kubernetes version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:38Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-30T06:34:50Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}
OS version
# On Linux:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
$ uname -a
Linux yd-CZC34914Z1 5.5.0-050500-generic #202001262030 SMP Mon Jan 27 01:33:36 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Kepler deployment config
helm repo add kepler https://sustainable-computing-io.github.io/kepler-helm-chart
helm install kepler kepler/kepler --namespace kepler --create-namespace --values config/kepler/kepler-value.yaml
value.yaml
# -- Replaces the name of the chart in the Chart.yaml file
nameOverride: ""
# -- Replaces the generated name
fullnameOverride: ""
image:
# -- Repository to pull the image from
repository: "quay.io/sustainable_computing_io/kepler"
# -- Image tag, if empty it will get it from the chart's appVersion
tag: ""
# -- Pull policy
pullPolicy: Always
# -- Secret name for pulling images from private repository
imagePullSecrets: []
# -- Additional DaemonSet annotations
annotations: {}
# -- Additional pod annotations
podAnnotations: {}
# -- Privileges and access control settings for a Pod (all containers in a pod)
podSecurityContext: {}
# fsGroup: 2000
# -- Privileges and access control settings for a container
securityContext:
privileged: true
# -- Node selection constraint
nodeSelector: {}
# -- Toleration for taints
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
# -- Affinity rules
affinity: {}
# -- CPU/MEM resources
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 100m
memory: 200Mi
# -- Extra environment variables
extraEnvVars:
KEPLER_LOG_LEVEL: "1"
ENABLE_GPU: "true"
ENABLE_EBPF_CGROUPID: "true"
EXPOSE_IRQ_COUNTER_METRICS: "true"
EXPOSE_KUBELET_METRICS: "true"
ENABLE_PROCESS_METRICS: "true"
CPU_ARCH_OVERRIDE: ""
CGROUP_METRICS: "*"
service:
annotations: {}
type: ClusterIP
port: 9102
serviceAccount:
# Specifies whether a service account should be created
create: true
# Annotations to add to the service account
annotations: {}
# The name of the service account to use.
# If not set and create is true, a name is generated using the fullname template
name: ""
serviceMonitor:
enabled: true
namespace: ""
interval: 1m
scrapeTimeout: 10s
labels: {}
Related plugins (CNI, CSI, …) and versions (if applicable)
MULTUS CNI
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 19 (9 by maintainers)
I came across the same issue when using the
kube-prometheus-stack
helm chart. As @clin4 mentioned, in my case, it was due to a missing label. By default, the helm chart adds a service monitor selector (preventing Prometheus from matching all service monitors) unless you disable this configuration: https://github.com/prometheus-community/helm-charts/blob/f36d97ed314926a8a735a4d97f37af756ebc0bcb/charts/kube-prometheus-stack/values.yaml#L3021for me, it sounds like your servicemonitor is missing a label, which let prometheus cannot discovery your servicemonitor. when you install the kepler via helm, besides enable service monitor, you also need provide a label, since I am using Terraform, the code looks like
In the end in the service monitor you can find this label
thank you @juangascon for the pointer!
@edoblette Can you get the service monitor yaml from your setup?
If it helps, can you use this yaml I just created with OPTS=“PROMETHEUS_DEPLOY”?