VictoriaMetrics: vmagent: failed to recognize scraping targets

Describe the bug

VMAgent does not start scraping for some target Endpoints though there exists the Endpoints object which matches the label selector in the VMAgent config file.

To Reproduce

I’m afraid to say that I don’t have simple and concrete steps for reproduction. The issue does not seem highly reproducible. I have tested the same configuration in a CI test workflow for multiple times, and seen 3 successes and 4 failures. The K8s cluster is created from scratch for each CI test using virtual machines. The configuration is not so simple. I’m using the VictoriaMetrics operator with two VM instances; one is defined by VMSingle and the other by VMCluster. The two VM instances monitor each other, and the VMCluster instance monitors many other metrics in the K8s cluster. Under this configuration, the VMCluster instance fails to start scraping for the AlertManager of the VMSingle instance.

The configuration is working well with VM 1.55.1, and fails with VM 1.58.0.

I have had a similar issue with VM 1.57.2. I could solve the problem by updating target K8s objects, e.g. adding an annotation to the Endpoints. I tried the same thing for VM 1.58.0 with no luck. As a last resort, restarting the VMAgent Pods solves the problem.

Expected behavior

VMAgent starts scraping, even without reboot, for the targets which are specified in the configuration file and exist in the K8s cluster.

Screenshots

The configuration file for the VMAgent in the VMCluster instance contains the following setting.

- job_name: monitoring/vmalertmanager-smallset/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - monitoring
  relabel_configs:
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_component
    regex: monitoring
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_instance
    regex: vmalertmanager-smallset
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_app_kubernetes_io_name
    regex: vmalertmanager
  - action: keep
    source_labels:
    - __meta_kubernetes_service_label_managed_by
    regex: vm-operator
  - action: keep
    source_labels:
    - __meta_kubernetes_endpoint_port_name
    regex: http
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Node;(.*)
    replacement: ${1}
    target_label: node
  - source_labels:
    - __meta_kubernetes_endpoint_address_target_kind
    - __meta_kubernetes_endpoint_address_target_name
    separator: ;
    regex: Pod;(.*)
    replacement: ${1}
    target_label: pod
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod
  - source_labels:
    - __meta_kubernetes_namespace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: service
  - source_labels:
    - __meta_kubernetes_service_name
    target_label: job
    replacement: ${1}
  - target_label: endpoint
    replacement: http
  - target_label: job
    replacement: vmalertmanager-smallset

The VMAgent Pod is running and there exists the target Service/Endpoints. They were created almost at the same time.

$ kubectl -n monitoring get pod vmagent-vmagent-largeset-768f7c64d5-26ckz; kubectl -n monitoring get service vmalertmanager-vmalertmanager-smallset --show-labels; kubectl -n monitoring get endpoints vmalertmanager-vmalertmanager-smallset --show-labels
NAME                                        READY   STATUS    RESTARTS   AGE
vmagent-vmagent-largeset-768f7c64d5-26ckz   2/2     Running   0          63m
NAME                                     TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE   LABELS
vmalertmanager-vmalertmanager-smallset   ClusterIP   None         <none>        9093/TCP,9094/TCP,9094/UDP   63m   app.kubernetes.io/component=monitoring,app.kubernetes.io/instance=vmalertmanager-smallset,app.kubernetes.io/name=vmalertmanager,managed-by=vm-operator
NAME                                     ENDPOINTS                                            AGE   LABELS
vmalertmanager-vmalertmanager-smallset   10.64.0.190:9094,10.64.0.190:9094,10.64.0.190:9093   63m   app.kubernetes.io/component=monitoring,app.kubernetes.io/instance=vmalertmanager-smallset,app.kubernetes.io/name=vmalertmanager,managed-by=vm-operator,service.kubernetes.io/headless=

And the VMAgent Pod does not recognize the target. The following grep returns a non-empty result in a successful environment.

$ kubectl -n monitoring exec vmagent-vmagent-largeset-768f7c64d5-26ckz -c vmagent -- curl -s http://localhost:8429/api/v1/targets | jq . | grep 'job.*vmalertmanager-smallset'
(empty)

Version

$ kubectl -n monitoring exec vmagent-vmagent-largeset-768f7c64d5-26ckz -c vmagent -- /vmagent --version                                  vmagent-20210409-110604-v1.58.0

Used command-line flags

Explicit command-line flags in the Pod definition:

    - -httpListenAddr=:8429
    - -promscrape.config=/etc/vmagent/config_out/vmagent.env.yaml
    - -promscrape.maxScrapeSize=33554432
    - -remoteWrite.maxDiskUsagePerURL=1073741824
    - -remoteWrite.tmpDataPath=/tmp/vmagent-remotewrite-data
    - -remoteWrite.url=http://vminsert-vmcluster-largeset.monitoring.svc:8480/insert/0/prometheus/api/v1/write

Flags in /metrics:

$ kubectl -n monitoring exec vmagent-vmagent-largeset-768f7c64d5-26ckz -c vmagent -- curl -s http://localhost:8429/metrics | grep '^flag'
flag{name="csvTrimTimestamp", value="1ms"} 1
flag{name="dryRun", value="false"} 1
flag{name="enableTCP6", value="false"} 1
flag{name="envflag.enable", value="false"} 1
flag{name="envflag.prefix", value=""} 1
flag{name="fs.disableMmap", value="false"} 1
flag{name="graphiteListenAddr", value=""} 1
flag{name="graphiteTrimTimestamp", value="1s"} 1
flag{name="http.connTimeout", value="2m0s"} 1
flag{name="http.disableResponseCompression", value="false"} 1
flag{name="http.idleConnTimeout", value="1m0s"} 1
flag{name="http.maxGracefulShutdownDuration", value="7s"} 1
flag{name="http.pathPrefix", value=""} 1
flag{name="http.shutdownDelay", value="0s"} 1
flag{name="httpAuth.password", value="secret"} 1
flag{name="httpAuth.username", value=""} 1
flag{name="httpListenAddr", value=":8429"} 1
flag{name="import.maxLineLen", value="104857600"} 1
flag{name="influx.databaseNames", value=""} 1
flag{name="influx.maxLineSize", value="262144"} 1
flag{name="influxListenAddr", value=""} 1
flag{name="influxMeasurementFieldSeparator", value="_"} 1
flag{name="influxSkipMeasurement", value="false"} 1
flag{name="influxSkipSingleField", value="false"} 1
flag{name="influxTrimTimestamp", value="1ms"} 1
flag{name="insert.maxQueueDuration", value="1m0s"} 1
flag{name="loggerDisableTimestamps", value="false"} 1
flag{name="loggerErrorsPerSecondLimit", value="0"} 1
flag{name="loggerFormat", value="default"} 1
flag{name="loggerLevel", value="INFO"} 1
flag{name="loggerOutput", value="stderr"} 1
flag{name="loggerTimezone", value="UTC"} 1
flag{name="loggerWarnsPerSecondLimit", value="0"} 1
flag{name="maxConcurrentInserts", value="40"} 1
flag{name="maxInsertRequestSize", value="33554432"} 1
flag{name="memory.allowedBytes", value="0"} 1
flag{name="memory.allowedPercent", value="60"} 1
flag{name="metricsAuthKey", value="secret"} 1
flag{name="opentsdbHTTPListenAddr", value=""} 1
flag{name="opentsdbListenAddr", value=""} 1
flag{name="opentsdbTrimTimestamp", value="1s"} 1
flag{name="opentsdbhttp.maxInsertRequestSize", value="33554432"} 1
flag{name="opentsdbhttpTrimTimestamp", value="1ms"} 1
flag{name="pprofAuthKey", value="secret"} 1
flag{name="promscrape.cluster.memberNum", value="0"} 1
flag{name="promscrape.cluster.membersCount", value="0"} 1
flag{name="promscrape.cluster.replicationFactor", value="1"} 1
flag{name="promscrape.config", value="/etc/vmagent/config_out/vmagent.env.yaml"} 1
flag{name="promscrape.config.dryRun", value="false"} 1
flag{name="promscrape.config.strictParse", value="false"} 1
flag{name="promscrape.configCheckInterval", value="0s"} 1
flag{name="promscrape.consul.waitTime", value="0s"} 1
flag{name="promscrape.consulSDCheckInterval", value="30s"} 1
flag{name="promscrape.disableCompression", value="false"} 1
flag{name="promscrape.disableKeepAlive", value="false"} 1
flag{name="promscrape.discovery.concurrency", value="100"} 1
flag{name="promscrape.discovery.concurrentWaitTime", value="1m0s"} 1
flag{name="promscrape.dnsSDCheckInterval", value="30s"} 1
flag{name="promscrape.dockerswarmSDCheckInterval", value="30s"} 1
flag{name="promscrape.dropOriginalLabels", value="false"} 1
flag{name="promscrape.ec2SDCheckInterval", value="1m0s"} 1
flag{name="promscrape.eurekaSDCheckInterval", value="30s"} 1
flag{name="promscrape.fileSDCheckInterval", value="30s"} 1
flag{name="promscrape.gceSDCheckInterval", value="1m0s"} 1
flag{name="promscrape.kubernetes.apiServerTimeout", value="30m0s"} 1
flag{name="promscrape.kubernetesSDCheckInterval", value="30s"} 1
flag{name="promscrape.maxDroppedTargets", value="1000"} 1
flag{name="promscrape.maxScrapeSize", value="33554432"} 1
flag{name="promscrape.openstackSDCheckInterval", value="30s"} 1
flag{name="promscrape.streamParse", value="false"} 1
flag{name="promscrape.suppressDuplicateScrapeTargetErrors", value="false"} 1
flag{name="promscrape.suppressScrapeErrors", value="false"} 1
flag{name="remoteWrite.basicAuth.password", value="secret"} 1
flag{name="remoteWrite.basicAuth.username", value=""} 1
flag{name="remoteWrite.bearerToken", value="secret"} 1
flag{name="remoteWrite.flushInterval", value="1s"} 1
flag{name="remoteWrite.label", value=""} 1
flag{name="remoteWrite.maxBlockSize", value="8388608"} 1
flag{name="remoteWrite.maxDiskUsagePerURL", value="1073741824"} 1
flag{name="remoteWrite.proxyURL", value=""} 1
flag{name="remoteWrite.queues", value="4"} 1
flag{name="remoteWrite.rateLimit", value=""} 1
flag{name="remoteWrite.relabelConfig", value=""} 1
flag{name="remoteWrite.roundDigits", value=""} 1
flag{name="remoteWrite.sendTimeout", value=""} 1
flag{name="remoteWrite.showURL", value="false"} 1
flag{name="remoteWrite.significantFigures", value=""} 1
flag{name="remoteWrite.tlsCAFile", value=""} 1
flag{name="remoteWrite.tlsCertFile", value=""} 1
flag{name="remoteWrite.tlsInsecureSkipVerify", value=""} 1
flag{name="remoteWrite.tlsKeyFile", value="secret"} 1
flag{name="remoteWrite.tlsServerName", value=""} 1
flag{name="remoteWrite.tmpDataPath", value="/tmp/vmagent-remotewrite-data"} 1
flag{name="remoteWrite.url", value="secret"} 1
flag{name="remoteWrite.urlRelabelConfig", value=""} 1
flag{name="sortLabels", value="false"} 1
flag{name="tls", value="false"} 1
flag{name="tlsCertFile", value=""} 1
flag{name="tlsKeyFile", value="secret"} 1
flag{name="version", value="false"} 1

Additional context

There are 16 Endpoints objects in the target namespace finally. The VMAgent says it found 14 objects at startup.

$ kubectl -n monitoring get endpoints --no-headers | wc -l
16
$ kubectl -n monitoring logs vmagent-vmagent-largeset-768f7c64d5-26ckz -c vmagent
(snip)
2021-04-22T10:00:20.675Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:419  started endpoints watcher for "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
2021-04-22T10:00:20.954Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:522  reloaded 14 objects from "https://10.68.0.1:443/api/v1/namespaces/monitoring/endpoints"
(snip)

The VMAgent continued to fail scraping for some other targets. This was because the metrics endpoints were not listened. I suppose this is not the cause of the issue.

2021-04-22T10:45:07.447Z        error   VictoriaMetrics/lib/promscrape/scrapework.go:258        error when scraping "http://10.72.48.65:8080/metrics" from job "coil-egress" with labels {container="egress",endpoint="metrics",instance="10.72.48.65:8080",job="coil-egress",namespace="domestic-egress",pod="nat-57bbb5f748-k6bp4",prometheus="monitoring/vmagent-largeset"}: error when scraping "http://10.72.48.65:8080/metrics": dial tcp4 10.72.48.65:8080: i/o timeout; try -enableTCP6 command-line flag if you scrape ipv6 addresses
(and the same error messages...)

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 24 (4 by maintainers)

Commits related to this issue

Most upvoted comments

@f41gh7 @valyala I executed my CI test 8 times with v1.60.0, and got 8 succeeds and 0 failures. Great! I really appreciate your consideration and support!

I’ve got the following 2 types of logs. In both cases, I found that the desired targets were detected.

Type 1: ‘found changes in “/etc/vmagent/config_out/vmagent.env.yaml”’

...
2021-05-26T14:51:43.732Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378  started node watcher for "https://10.68.0.1:443/api/v1/nodes"
2021-05-26T14:51:43.785Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467  reloaded 7 objects from "https://10.68.0.1:443/api/v1/nodes"; updated=0, removed=0, added=7, resourceVersion="10672"
2021-05-26T14:51:43.786Z        info    VictoriaMetrics/lib/promscrape/config.go:69     started service discovery routines in 3.115 seconds
2021-05-26T14:51:43.868Z        info    VictoriaMetrics/lib/promscrape/scraper.go:359   kubernetes_sd_configs: added targets: 131, removed targets: 0; total targets: 131
2021-05-26T14:51:45.046Z        info    VictoriaMetrics/lib/promscrape/scraper.go:126   SIGHUP received; reloading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:51:45.046Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:148      SIGHUP received; reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig
2021-05-26T14:51:45.046Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:155      Successfully reloaded relabel configs
2021-05-26T14:51:45.072Z        info    VictoriaMetrics/lib/promscrape/scraper.go:133   nothing changed in "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:51:46.052Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:148      SIGHUP received; reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig
2021-05-26T14:51:46.052Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:155      Successfully reloaded relabel configs
2021-05-26T14:51:46.052Z        info    VictoriaMetrics/lib/promscrape/scraper.go:126   SIGHUP received; reloading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:51:46.071Z        info    VictoriaMetrics/lib/promscrape/config.go:74     stopping service discovery routines...
2021-05-26T14:51:46.071Z        info    VictoriaMetrics/lib/promscrape/config.go:78     stopped service discovery routines in 0.000 seconds
2021-05-26T14:51:46.071Z        info    VictoriaMetrics/lib/promscrape/config.go:65     starting service discovery routines...
2021-05-26T14:51:46.076Z        info    VictoriaMetrics/lib/promscrape/config.go:69     started service discovery routines in 0.005 seconds
2021-05-26T14:51:46.077Z        info    VictoriaMetrics/lib/promscrape/scraper.go:162   found changes in "/etc/vmagent/config_out/vmagent.env.yaml"; applying these changes
2021-05-26T14:51:46.143Z        info    VictoriaMetrics/lib/promscrape/scraper.go:359   kubernetes_sd_configs: added targets: 1, removed targets: 0; total targets: 132
...

Type 2: ‘nothing changed in “/etc/vmagent/config_out/vmagent.env.yaml”’

...
2021-05-26T14:52:56.960Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378  started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/external-dns/pods"
2021-05-26T14:52:57.060Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467  reloaded 0 objects from "https://10.68.0.1:443/api/v1/namespaces/external-dns/pods"; updated=0, removed=0, added=0, resourceVersion="10438"
2021-05-26T14:52:57.061Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378  started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/logging/pods"
2021-05-26T14:52:57.163Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:148      SIGHUP received; reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig
2021-05-26T14:52:57.163Z        info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:155      Successfully reloaded relabel configs
2021-05-26T14:52:57.298Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467  reloaded 36 objects from "https://10.68.0.1:443/api/v1/namespaces/logging/pods"; updated=0, removed=0, added=36, resourceVersion="10442"
2021-05-26T14:52:57.299Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378  started pod watcher for "https://10.68.0.1:443/api/v1/namespaces/metallb-system/pods"
2021-05-26T14:52:57.393Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467  reloaded 5 objects from "https://10.68.0.1:443/api/v1/namespaces/metallb-system/pods"; updated=0, removed=0, added=5, resourceVersion="10444"
...
2021-05-26T14:52:57.590Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:378  started node watcher for "https://10.68.0.1:443/api/v1/nodes"
2021-05-26T14:52:57.647Z        info    VictoriaMetrics/lib/promscrape/discovery/kubernetes/api_watcher.go:467  reloaded 7 objects from "https://10.68.0.1:443/api/v1/nodes"; updated=0, removed=0, added=7, resourceVersion="10448"
2021-05-26T14:52:57.648Z        info    VictoriaMetrics/lib/promscrape/config.go:69     started service discovery routines in 5.893 seconds
2021-05-26T14:52:57.648Z        info    VictoriaMetrics/lib/promscrape/scraper.go:126   SIGHUP received; reloading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:52:57.702Z        info    VictoriaMetrics/lib/promscrape/scraper.go:133   nothing changed in "/etc/vmagent/config_out/vmagent.env.yaml"
2021-05-26T14:52:57.755Z        info    VictoriaMetrics/lib/promscrape/scraper.go:359   kubernetes_sd_configs: added targets: 129, removed targets: 0; total targets: 129
...

FYI, the bugfix related to missing SIGHUP handling has been added in vmagent v1.60.0.

Also, if solution above didnt help. I’ve made build with logging of vmagent discovery, it must help identify source of issue with missing target labels. May i ask to run it and provide log file?

docker image: victoriametrics/vmagent:heads-vmagent-logging-0-g6b9080c03 , its built from branch https://github.com/VictoriaMetrics/VictoriaMetrics/tree/vmagent-discovery-logging

Thanks for additional details!

It looks like vmagent didn’t re-read the -promscrape.config file on the first SIGHUP signal according to logs from https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240#issuecomment-840977886 . This could result in incomplete list of scrape jobs if the initial config file had no some VictoriaMetrics-related jobs. This should be fixed in the commit c54bb73867ae0024b5acb0bcaa611a129b012abc . Could you verify whether vmagent built from this commit correctly discovers scrape targets on the first run?

Could you also try running vmagent from the commit 2eed410466d335598415429a57567ab6a7ac6d1d and verify whether it fixes the issue with service discovery?

Could you also share graphs for the following metrics collected from vmagent (see these docs about vmagent metrics) during succeeded and failed service discovery:

  • vm_promscrape_discovery_kubernetes_url_watchers
  • vm_promscrape_discovery_kubernetes_subscribers
  • vm_promscrape_discovery_kubernetes_objects
  • vm_promscrape_discovery_kubernetes_objects_added_total
  • vm_promscrape_discovery_kubernetes_objects_removed_total
  • vm_promscrape_discovery_kubernetes_objects_updated_total
  • vm_promscrape_discovery_kubernetes_stale_resource_versions_total

@valyala I tried with cfd6aa2 6 times. The results were 3 succeeds and 3 failures.

There was a slight difference in successful cases. Before cfd6aa2, the result of curl .../targets | jq . | grep ... started from about 40 lines and diminished to 5 lines. With cfd5aa2, curl & grep constantly returned about 50 lines. In both cases, the most of the lines were output as droppedTargets. So refreshing all scrape targets seems to take some effect indeed. I don’t have a concrete idea for improvement, though.

@morimoto-cybozu , could you check whether vmagent built from the commit cfd6aa28e1dc8bbd68a7677686be7daa6008ab33 fixes the issue? This commit refreshes all the endpoints and endpointslices scrape targets once per 5 seconds, so all the changes to the dependent pod and service objects must eventually propagate to endpoints and endpointslices scrape targets in under 5 seconds.

@morimoto-cybozu i’ve prepared docker image: victoriametrics/vmagent:heads-master-0-gd9e3872b1 built from master branch. For me it seems to work correctly.

The bugfix has been included in vmagent v1.59.0. @morimoto-cybozu , could you verify whether it works as expected now?

@morimoto-cybozu , could you build vmagent from the commit eddba29664eb63214ca529bfa5d24426b91d70cd and verify whether it fixes the issue? See build instructions for vmagent.