prometheus-operator: Prometheus-Operator: Alertmanager peers - wrong service hostname
What did you do? Install this chart https://github.com/helm/charts/tree/master/stable/prometheus-operator
What did you expect to see? My one pod alertmanager cluster initialise without error
This ARG:
--cluster.peer=alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:6783
Should be:
--cluster.peer=alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc.cluster.local:6783
What did you see instead? Under which circumstances? Alertmanager fails to join its cluster. No consequence to me but would be a problem for anyone wanting HA.
Log Line
msg="unable to join gossip mesh" err="1 error occurred:\n\n* Failed to resolve alertmanager prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host"
Environment N/A
-
Prometheus Operator version:
v0.26.0
-
Kubernetes version information:
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-13T23:16:58Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.6", GitCommit:"b1d75deca493a24a2f87eb1efde1a569e52fc8d9", GitTreeState:"clean", BuildDate:"2018-12-16T04:30:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
- Kubernetes cluster kind:
StatefulSet
~>Pod
- Manifests:
apiVersion: apps/v1
kind: StatefulSet
metadata:
creationTimestamp: "2019-02-13T09:20:28Z"
generation: 1
labels:
app: prometheus-operator-alertmanager
chart: prometheus-operator-2.1.4
heritage: Tiller
release: prometheus-operator
name: alertmanager-prometheus-operator-alertmanager
namespace: monitoring
ownerReferences:
- apiVersion: monitoring.coreos.com/v1
blockOwnerDeletion: true
controller: true
kind: Alertmanager
name: prometheus-operator-alertmanager
uid: 95aaa298-2f70-11e9-b491-026f3003ef94
resourceVersion: "158616929"
selfLink: /apis/apps/v1/namespaces/monitoring/statefulsets/alertmanager-prometheus-operator-alertmanager
uid: 9a67ac1b-2f70-11e9-b491-026f3003ef94
spec:
podManagementPolicy: OrderedReady
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
alertmanager: prometheus-operator-alertmanager
app: alertmanager
serviceName: alertmanager-operated
template:
metadata:
creationTimestamp: null
labels:
alertmanager: prometheus-operator-alertmanager
app: alertmanager
spec:
containers:
- args:
- --config.file=/etc/alertmanager/config/alertmanager.yaml
- --cluster.listen-address=[$(POD_IP)]:6783
- --storage.path=/alertmanager
- --data.retention=120h
- --web.listen-address=:9093
- --web.external-url=http://prometheus-operator-alertmanager.monitoring:9093
- --web.route-prefix=/
- --cluster.peer=alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:6783
- Prometheus Operator Logs: Full Log
level=info ts=2019-02-13T09:20:29.789430593Z caller=main.go:174 msg="Starting Alertmanager" version="(version=0.15.3, branch=HEAD, revision=d4a7697cc90f8bce62efe7c44b63b542578ec0a1)"
level=info ts=2019-02-13T09:20:29.789539982Z caller=main.go:175 build_context="(go=go1.11.2, user=root@4ecc17c53d26, date=20181109-15:40:48)"
level=warn ts=2019-02-13T09:20:29.862066463Z caller=cluster.go:219 component=cluster msg="failed to join cluster" err="1 error occurred:\n\n* Failed to resolve alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host"
level=info ts=2019-02-13T09:20:29.862189884Z caller=cluster.go:221 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2019-02-13T09:20:29.862225117Z caller=main.go:265 msg="unable to join gossip mesh" err="1 error occurred:\n\n* Failed to resolve alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc:6783: lookup alertmanager-prometheus-operator-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host"
level=info ts=2019-02-13T09:20:29.862457546Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2019-02-13T09:20:29.862474947Z caller=cluster.go:570 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2019-02-13T09:20:29.868174633Z caller=main.go:398 msg=Listening address=:9093
level=info ts=2019-02-13T09:20:31.862912559Z caller=cluster.go:595 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000228422s
level=info ts=2019-02-13T09:20:39.863777261Z caller=cluster.go:587 component=cluster msg="gossip settled; proceeding" elapsed=10.001095515s
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 3
- Comments: 22 (11 by maintainers)
It seems the reason is
"undefined receiver \"webhook\" used in route"
error in my alertmanager-secret.yamlAfter fixed the config file, The alertmanager pods run well.
faced similar issue, alertmanager config was the reason. https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/alerting.md
I’m still seeing this problem with v0.16.1 of alertmanager. I’m installing with helm chart version 4.3.3 on a brand new cluster in EKS. The same installation config worked on minikube, but seems to fail on EKS.
The pod logs for the alertmanager pod show the problem mentioned in this thread.
I tried blowing up the pod and letting it reschedule, but it didn’t seem to help. I suspect the DNS lookup is failing, but will need to get a bastion setup out in the env to be able to ssh into the worker node and run those dig commands.
Is there any other information I can provide on this? I don’t think this is resolved in 16.1.
Yep, apologies, that instance of alertmanager was looking up kube-dns on the wrong address. Simply bouncing the pod yielded the log below.
All seems to be working as stated before. I was just querying the missing suffix on the internal A record.
Thanks