rook: Ceph MGR: 2 modules failed on default install

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: Two Ceph MGR modules failed to come up causing the Ceph cluster to report HEALTH_ERROR state.

See logs: https://gist.github.com/galexrt/3626102e96dddcef071060b71d94e280

Expected behavior: The dashboard and prometheus modules to work fine.

How to reproduce it (minimal and precise):

  1. Use the example cluster.yaml, in my case in a minikube environment on K8S 1.11.4.

Environment:

  • OS (e.g. from /etc/os-release): ``` NAME=“CentOS Linux” VERSION=“7 (Core)” ID=“centos” ID_LIKE=“rhel fedora” VERSION_ID=“7” PRETTY_NAME=“CentOS Linux 7 (Core)” ANSI_COLOR=“0;31” CPE_NAME=“cpe:/o:centos:centos:7” HOME_URL=“https://www.centos.org/” BUG_REPORT_URL=“https://bugs.centos.org/

CENTOS_MANTISBT_PROJECT=“CentOS-7” CENTOS_MANTISBT_PROJECT_VERSION=“7” REDHAT_SUPPORT_PRODUCT=“centos” REDHAT_SUPPORT_PRODUCT_VERSION=“7”

* Kernel (e.g. `uname -a`): `Linux minikube 4.15.0 #1 SMP Fri Oct 5 20:44:14 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux`
* Cloud provider or hardware configuration:
* Rook version (use `rook version` inside of a Rook Pod): `rook: v0.8.0-350.g18b2da5f` (freshly built from latest `master` this morning, https://github.com/rook/rook/commit/18b2da5fc5d7a303b9a48119ce55108b55af7f0e)
* Kubernetes version (use `kubectl version`): ```
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:04:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.4", GitCommit:"bf9a868e8ea3d3a8fa53cbb22f566771b3f8068b", GitTreeState:"clean", BuildDate:"2018-10-25T19:06:30Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): minikube
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_ERROR - 2 modules have failed

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 3
  • Comments: 26 (12 by maintainers)

Most upvoted comments

Ok, I believe I have prototyped the fix… The setting on the mgr modules for server_addr needs to be set to the pod IP. By default the dashboard and prometheus modules are binding to :: (all interfaces) as seen here, which is causing the issues in the k8s clusters.

The fix can be tested by running the following commands from the toolbox:

# Get the IP of the mgr pod:
kubectl -n rook-ceph get pod -l app=rook-ceph-mgr -o wide

# Set the server_addr for the two modules (replacing the pod ip queried above)
ceph config set mgr.a mgr/prometheus/server_addr <podIP>
ceph config set mgr.a mgr/dashboard/server_addr  <podIP>

# Restart the mgr pod in one of these two ways:
# 1) The easy way is to delete the pod, however depending on your env it may get 
# a new pod ip
kubectl -n rook-ceph delete pod -l app=rook-ceph-mgr

# 2) Alternatively, exec into the mgr pod and kill the ceph-mgr process so the same pod 
# will simply restart
kubectl -n rook-ceph exec -it <pod> bash
# ceph-mgr is running as pid 1
kill 1

Now to automate this when the mgr pod starts up…

Hi, If somebody still interested, I had similar dashboard issue for mimic. Mon nodes run mgr roles in mimic. Apparently there are some IPv6 dependency for dashboard module, and IPv6 should be enable on mom nodes during installation/configuration. I usually disable IPv6 and as result run into dashboard and Prometheus issues. To fix an issues you need to:

  1. Enable IPv6 on all monitor nodes
  2. You need to check mgr database records “ceph config dump”
  3. Clean it if needed and make sure that there is a records for each mon node: mon1 is a, mon2 is b and … mgr/dashboard/a/server_addr “mon node IPv4” mgr/dashboard/a/server_addr “port for dashboard 7000”
  4. There are should be two general records for dashboard module without a, b … with reference to IPv6 all addresses :: mgr/dashboard/server_addr :: mgr/dashboard/server_port 7000

@liejuntao001 as far as I understand from the community, we’ll need to wait for ceph to publish docker version 13.2.3 in order to solve the prometheus & ceph dashboard bugs in mimic. they suppose to publish it around Jan 2019.

Had the same problem multiple times after starting over with a fresh setup.
Try this as a workaround:

kubectl -n rook-ceph exec -it rook-ceph-tools -- ceph status     # reports HEALTH_ERR
kubectl -n rook-ceph exec -it rook-ceph-tools -- ceph mgr module disable prometheus
kubectl -n rook-ceph exec -it rook-ceph-tools -- ceph mgr module disable dashboard
kubectl -n rook-ceph exec -it rook-ceph-tools -- ceph status     # reports HEALTH_OK

As this is a CentOS based image, @epuertat have you seen this in your RH downstream testing?

I’ve seen that, but my common setup has been, you know, CentOS7 + custom Luminous backport of the dashboard.

Just run a search and found that this was also happening in Luminous 12.2.5’s Prometheus module and dashboard too. But it was mostly fixed with https://github.com/ceph/ceph/pull/15588.

In the past I was able to work around this issue by setting the listening IP to a specific local address, instead of the default 0.0.0.0 or ::/128.

With master I can see a similar error with the restful module (I forced that by immediately disabling and enabling the dashboard module):

2018-12-14 10:33:08.550 7fab244ef700  0 mgr[restful] Traceback (most recent call last):
  File "/ceph/src/pybind/mgr/restful/module.py", line 255, in serve
    self._serve()
  File "/ceph/src/pybind/mgr/restful/module.py", line 330, in _serve
    ssl_context=(cert_fname, pkey_fname),
  File "/usr/lib/python2.7/site-packages/werkzeug/serving.py", line 486, in make_server
    passthrough_errors, ssl_context)
  File "/usr/lib/python2.7/site-packages/werkzeug/serving.py", line 410, in __init__
    HTTPServer.__init__(self, (host, int(port)), handler)
  File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__
    self.server_bind()
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
    SocketServer.TCPServer.server_bind(self)
  File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
    self.socket.bind(self.server_address)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 98] Address already in use

And finally, https://github.com/ceph/ceph/pull/24734 was merged into 13.2.3. Do you see the same behavior with 13.2.3?