prometheus-operator: Prometheus-Operator: Alertmanager peers - wrong service hostname

What did you do? Install this chart https://github.com/helm/charts/tree/master/stable/prometheus-operator

What did you expect to see? My one pod alertmanager cluster initialise without error

This ARG:

 --cluster.peer=alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:6783

Should be:

 --cluster.peer=alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc.cluster.local:6783

What did you see instead? Under which circumstances? Alertmanager fails to join its cluster. No consequence to me but would be a problem for anyone wanting HA.

Log Line

msg="unable to join gossip mesh" err="1 error occurred:\n\n* Failed to resolve alertmanager prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host"

Environment N/A

  • Prometheus Operator version:

    v0.26.0

  • Kubernetes version information:

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-13T23:16:58Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.6", GitCommit:"b1d75deca493a24a2f87eb1efde1a569e52fc8d9", GitTreeState:"clean", BuildDate:"2018-12-16T04:30:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind:

StatefulSet ~>Pod

  • Manifests:
apiVersion: apps/v1
kind: StatefulSet
metadata:
  creationTimestamp: "2019-02-13T09:20:28Z"
  generation: 1
  labels:
    app: prometheus-operator-alertmanager
    chart: prometheus-operator-2.1.4
    heritage: Tiller
    release: prometheus-operator
  name: alertmanager-prometheus-operator-alertmanager
  namespace: monitoring
  ownerReferences:
  - apiVersion: monitoring.coreos.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: Alertmanager
    name: prometheus-operator-alertmanager
    uid: 95aaa298-2f70-11e9-b491-026f3003ef94
  resourceVersion: "158616929"
  selfLink: /apis/apps/v1/namespaces/monitoring/statefulsets/alertmanager-prometheus-operator-alertmanager
  uid: 9a67ac1b-2f70-11e9-b491-026f3003ef94
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      alertmanager: prometheus-operator-alertmanager
      app: alertmanager
  serviceName: alertmanager-operated
  template:
    metadata:
      creationTimestamp: null
      labels:
        alertmanager: prometheus-operator-alertmanager
        app: alertmanager
    spec:
      containers:
      - args:
        - --config.file=/etc/alertmanager/config/alertmanager.yaml
        - --cluster.listen-address=[$(POD_IP)]:6783
        - --storage.path=/alertmanager
        - --data.retention=120h
        - --web.listen-address=:9093
        - --web.external-url=http://prometheus-operator-alertmanager.monitoring:9093
        - --web.route-prefix=/
        - --cluster.peer=alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:6783
  • Prometheus Operator Logs: Full Log
level=info ts=2019-02-13T09:20:29.789430593Z caller=main.go:174 msg="Starting Alertmanager" version="(version=0.15.3, branch=HEAD, revision=d4a7697cc90f8bce62efe7c44b63b542578ec0a1)"
level=info ts=2019-02-13T09:20:29.789539982Z caller=main.go:175 build_context="(go=go1.11.2, user=root@4ecc17c53d26, date=20181109-15:40:48)"
level=warn ts=2019-02-13T09:20:29.862066463Z caller=cluster.go:219 component=cluster msg="failed to join cluster" err="1 error occurred:\n\n* Failed to resolve alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host"
level=info ts=2019-02-13T09:20:29.862189884Z caller=cluster.go:221 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2019-02-13T09:20:29.862225117Z caller=main.go:265 msg="unable to join gossip mesh" err="1 error occurred:\n\n* Failed to resolve alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host"
level=info ts=2019-02-13T09:20:29.862457546Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2019-02-13T09:20:29.862474947Z caller=cluster.go:570 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2019-02-13T09:20:29.868174633Z caller=main.go:398 msg=Listening address=:9093
level=info ts=2019-02-13T09:20:31.862912559Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000228422s
level=info ts=2019-02-13T09:20:39.863777261Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=10.001095515s

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 3
  • Comments: 22 (11 by maintainers)

Most upvoted comments

alertmanager-main-0.alertmanager-operated.monitoring.svc

It seems the reason is "undefined receiver \"webhook\" used in route" error in my alertmanager-secret.yaml

After fixed the config file, The alertmanager pods run well.

$ kubectl get pods -n monitoring
NAME                                   READY   STATUS    RESTARTS   AGE
alertmanager-main-0                    3/3     Running   0          38s
alertmanager-main-1                    3/3     Running   0          38s
alertmanager-main-2                    3/3     Running   0          38s
……

I’m still seeing this problem with v0.16.1 of alertmanager. I’m installing with helm chart version 4.3.3 on a brand new cluster in EKS. The same installation config worked on minikube, but seems to fail on EKS.

The pod logs for the alertmanager pod show the problem mentioned in this thread.

kc logs -n ops -f alertmanager-prometheus-operator-alertmanager-0 -c alertmanager
level=info ts=2019-03-13T16:23:44.495116716Z caller=main.go:177 msg="Starting Alertmanager" version="(version=0.16.1, branch=HEAD, revision=571caec278be1f0dbadfdf5effd0bbea16562cfc)"
level=info ts=2019-03-13T16:23:44.495197073Z caller=main.go:178 build_context="(go=go1.11.5, user=root@3000aa3a06c5, date=20190131-15:05:40)"
level=warn ts=2019-03-13T16:23:44.543859344Z caller=cluster.go:226 component=cluster msg="failed to join cluster" err="1 error occurred:\n\n* Failed to resolve alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.ops.svc:6783: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.ops.svc on 172.20.0.10:53: no such host"
level=info ts=2019-03-13T16:23:44.543906246Z caller=cluster.go:228 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2019-03-13T16:23:44.544266055Z caller=main.go:268 msg="unable to join gossip mesh" err="1 error occurred:\n\n* Failed to resolve alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.ops.svc:6783: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.ops.svc on 172.20.0.10:53: no such host"
level=info ts=2019-03-13T16:23:44.544465869Z caller=cluster.go:632 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2019-03-13T16:23:44.584079723Z caller=main.go:334 msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2019-03-13T16:23:44.588756986Z caller=main.go:428 msg=Listening address=:9093
level=info ts=2019-03-13T16:23:46.54470086Z caller=cluster.go:657 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000146659s
level=info ts=2019-03-13T16:23:54.545321816Z caller=cluster.go:649 component=cluster msg="gossip settled; proceeding" elapsed=10.000775836s
level=warn ts=2019-03-13T16:23:59.607746557Z caller=cluster.go:447 component=cluster msg=refresh result=failure addr=alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.ops.svc:6783

I tried blowing up the pod and letting it reschedule, but it didn’t seem to help. I suspect the DNS lookup is failing, but will need to get a bastion setup out in the env to be able to ssh into the worker node and run those dig commands.

Is there any other information I can provide on this? I don’t think this is resolved in 16.1.

Yep, apologies, that instance of alertmanager was looking up kube-dns on the wrong address. Simply bouncing the pod yielded the log below.

level=info ts=2019-02-25T15:41:00.684358725Z caller=main.go:177 msg="Starting Alertmanager" version="(version=0.16.1, branch=HEAD, revision=571caec278be1f0dbadfdf5effd0bbea16562cfc)"
level=info ts=2019-02-25T15:41:00.684522744Z caller=main.go:178 build_context="(go=go1.11.5, user=root@3000aa3a06c5, date=20190131-15:05:40)"
level=warn ts=2019-02-25T15:41:10.709881104Z caller=cluster.go:226 component=cluster msg="failed to join cluster" err="1 error occurred:\n\n* Failed to join 100.125.152.203: dial tcp 100.125.152.203:6783: i/o timeout"
level=info ts=2019-02-25T15:41:10.709931125Z caller=cluster.go:228 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2019-02-25T15:41:10.709962386Z caller=main.go:268 msg="unable to join gossip mesh" err="1 error occurred:\n\n* Failed to join 100.125.152.203: dial tcp 100.125.152.203:6783: i/o timeout"
level=info ts=2019-02-25T15:41:10.710035863Z caller=cluster.go:632 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2019-02-25T15:41:10.738633245Z caller=main.go:334 msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2019-02-25T15:41:10.742198899Z caller=main.go:428 msg=Listening address=:9093
level=info ts=2019-02-25T15:41:12.710306862Z caller=cluster.go:657 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000177515s
level=info ts=2019-02-25T15:41:20.711124188Z caller=cluster.go:649 component=cluster msg="gossip settled; proceeding" elapsed=10.000996563s

All seems to be working as stated before. I was just querying the missing suffix on the internal A record.

Thanks