rook: 1.11.2 - IPv6 Cluster creation stuck trying to get quorum status from first MON

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: When creating an IPv6-only cluster, the first MON comes up but the Operator fails to get quorum status, preventing creation of further MONs and the cluster.

Expected behavior: IPv6 Cluster creation should result in creation and quorum of all MONs.

How to reproduce it (minimal and precise):

IPv6-only K8s cluster
Use rook/ceph:v1.11.2-4.g9928f26cc
kubectl create -f crd.yaml -f common.yaml -f operator.yaml
Modify default cluster.yaml with the following values:
- ipFamily: "IPv6"
- dualStack: false
- requireMsgr2: true
kubectl create -f cluster.yaml
Observe creation of the first MON, but nothing further.

File(s) to submit:

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph # namespace:cluster
spec:
  cephVersion:
    # The container image used to launch the Ceph daemon pods (mon, mgr, osd, mds, rgw).
    # v16 is Pacific, and v17 is Quincy.
    # RECOMMENDATION: In production, use a specific version tag instead of the general v17 flag, which pulls the latest release and could result in different
    # versions running within the cluster. See tags available at https://hub.docker.com/r/ceph/ceph/tags/.
    # If you want to be more precise, you can always use a timestamp tag such quay.io/ceph/ceph:v17.2.3-20220805
    # This tag might not contain a new Ceph version, just security fixes from the underlying operating system, which will reduce vulnerabilities
    image: quay.io/ceph/ceph:v17.2.5
    # Whether to allow unsupported versions of Ceph. Currently `pacific` and `quincy` are supported.
    # Future versions such as `reef` (v18) would require this to be set to `true`.
    # Do not set to true in production.
    allowUnsupported: false
  # The path on the host where configuration files will be persisted. Must be specified.
  # Important: if you reinstall the cluster, make sure you delete this directory from each host or else the mons will fail to start on the new cluster.
  # In Minikube, the '/data' directory is configured to persist across reboots. Use "/data/rook" in Minikube environment.
  dataDirHostPath: /var/lib/rook
  # Whether or not upgrade should continue even if a check fails
  # This means Ceph's status could be degraded and we don't recommend upgrading but you might decide otherwise
  # Use at your OWN risk
  # To understand Rook's upgrade process of Ceph, read https://rook.io/docs/rook/latest/ceph-upgrade.html#ceph-version-upgrades
  skipUpgradeChecks: false
  # Whether or not continue if PGs are not clean during an upgrade
  continueUpgradeAfterChecksEvenIfNotHealthy: false
  # WaitTimeoutForHealthyOSDInMinutes defines the time (in minutes) the operator would wait before an OSD can be stopped for upgrade or restart.
  # If the timeout exceeds and OSD is not ok to stop, then the operator would skip upgrade for the current OSD and proceed with the next one
  # if `continueUpgradeAfterChecksEvenIfNotHealthy` is `false`. If `continueUpgradeAfterChecksEvenIfNotHealthy` is `true`, then operator would
  # continue with the upgrade of an OSD even if its not ok to stop after the timeout. This timeout won't be applied if `skipUpgradeChecks` is `true`.
  # The default wait timeout is 10 minutes.
  waitTimeoutForHealthyOSDInMinutes: 10
  mon:
    # Set the number of mons to be started. Generally recommended to be 3.
    # For highest availability, an odd number of mons should be specified.
    count: 3
    # The mons should be on unique nodes. For production, at least 3 nodes are recommended for this reason.
    # Mons should only be allowed on the same node for test environments where data loss is acceptable.
    allowMultiplePerNode: false
  mgr:
    # When higher availability of the mgr is needed, increase the count to 2.
    # In that case, one mgr will be active and one in standby. When Ceph updates which
    # mgr is active, Rook will update the mgr services to match the active mgr.
    count: 2
    allowMultiplePerNode: false
    modules:
      # Several modules should not need to be included in this list. The "dashboard" and "monitoring" modules
      # are already enabled by other settings in the cluster CR.
      - name: pg_autoscaler
        enabled: true
  # enable the ceph dashboard for viewing cluster status
  dashboard:
    enabled: true
    # serve the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy)
    # urlPrefix: /ceph-dashboard
    # serve the dashboard at the given port.
    # port: 8443
    # serve the dashboard using SSL
    ssl: false
  # enable prometheus alerting for cluster
  monitoring:
    # requires Prometheus to be pre-installed
    enabled: false
  network:
    connections:
      # Whether to encrypt the data in transit across the wire to prevent eavesdropping the data on the network.
      # The default is false. When encryption is enabled, all communication between clients and Ceph daemons, or between Ceph daemons will be encrypted.
      # When encryption is not enabled, clients still establish a strong initial authentication and data integrity is still validated with a crc check.
      # IMPORTANT: Encryption requires the 5.11 kernel for the latest nbd and cephfs drivers. Alternatively for testing only,
      # you can set the "mounter: rbd-nbd" in the rbd storage class, or "mounter: fuse" in the cephfs storage class.
      # The nbd and fuse drivers are *not* recommended in production since restarting the csi driver pod will disconnect the volumes.
      encryption:
        enabled: false
      # Whether to compress the data in transit across the wire. The default is false.
      # Requires Ceph Quincy (v17) or newer. Also see the kernel requirements above for encryption.
      compression:
        enabled: false
      # Whether to require communication over msgr2. If true, the msgr v1 port (6789) will be disabled
      # and clients will be required to connect to the Ceph cluster with the v2 port (3300).
      # Requires a kernel that supports msgr v2 (kernel 5.11 or CentOS 8.4 or newer).
      requireMsgr2: true
    # enable host networking
    provider: host
    # Provide internet protocol version. IPv6, IPv4 or empty string are valid options. Empty string would mean IPv4
    ipFamily: "IPv6"
    # Ceph daemons to listen on both IPv4 and Ipv6 networks
    dualStack: false
    # Enable multiClusterService to export the mon and OSD services to peer cluster.
    # This is useful to support RBD mirroring between two clusters having overlapping CIDRs.
    # Ensure that peer clusters are connected using an MCS API compatible application, like Globalnet Submariner.
    #multiClusterService:
    #  enabled: false

  # enable the crash collector for ceph daemon crash collection
  crashCollector:
    disable: false
    # Uncomment daysToRetain to prune ceph crash entries older than the
    # specified number of days.
    #daysToRetain: 30
  # enable log collector, daemons will log on files and rotate
  logCollector:
    enabled: true
    periodicity: daily # one of: hourly, daily, weekly, monthly
    maxLogSize: 500M # SUFFIX may be 'M' or 'G'. Must be at least 1M.
  # automate [data cleanup process](https://github.com/rook/rook/blob/master/Documentation/Storage-Configuration/ceph-teardown.md#delete-the-data-on-hosts) in cluster destruction.
  cleanupPolicy:
    # Since cluster cleanup is destructive to data, confirmation is required.
    # To destroy all Rook data on hosts during uninstall, confirmation must be set to "yes-really-destroy-data".
    # This value should only be set when the cluster is about to be deleted. After the confirmation is set,
    # Rook will immediately stop configuring the cluster and only wait for the delete command.
    # If the empty string is set, Rook will not destroy any data on hosts during uninstall.
    confirmation: ""
    # sanitizeDisks represents settings for sanitizing OSD disks on cluster deletion
    sanitizeDisks:
      # method indicates if the entire disk should be sanitized or simply ceph's metadata
      # in both case, re-install is possible
      # possible choices are 'complete' or 'quick' (default)
      method: quick
      # dataSource indicate where to get random bytes from to write on the disk
      # possible choices are 'zero' (default) or 'random'
      # using random sources will consume entropy from the system and will take much more time then the zero source
      dataSource: zero
      # iteration overwrite N times instead of the default (1)
      # takes an integer value
      iteration: 1
    # allowUninstallWithVolumes defines how the uninstall should be performed
    # If set to true, cephCluster deletion does not wait for the PVs to be deleted.
    allowUninstallWithVolumes: false
  # To control where various services will be scheduled by kubernetes, use the placement configuration sections below.
  # The example under 'all' would have all services scheduled on kubernetes nodes labeled with 'role=storage-node' and
  # tolerate taints with a key of 'storage-node'.
  # placement:
  #   all:
  #     nodeAffinity:
  #       requiredDuringSchedulingIgnoredDuringExecution:
  #         nodeSelectorTerms:
  #         - matchExpressions:
  #           - key: role
  #             operator: In
  #             values:
  #             - storage-node
  #     podAffinity:
  #     podAntiAffinity:
  #     topologySpreadConstraints:
  #     tolerations:
  #     - key: storage-node
  #       operator: Exists
  # The above placement information can also be specified for mon, osd, and mgr components
  #   mon:
  # Monitor deployments may contain an anti-affinity rule for avoiding monitor
  # collocation on the same node. This is a required rule when host network is used
  # or when AllowMultiplePerNode is false. Otherwise this anti-affinity rule is a
  # preferred rule with weight: 50.
  #   osd:
  #    prepareosd:
  #    mgr:
  #    cleanup:
  annotations:
  #   all:
  #   mon:
  #   osd:
  #   cleanup:
  #   prepareosd:
  # clusterMetadata annotations will be applied to only `rook-ceph-mon-endpoints` configmap and the `rook-ceph-mon` and `rook-ceph-admin-keyring` secrets.
  # And clusterMetadata annotations will not be merged with `all` annotations.
  #    clusterMetadata:
  #       kubed.appscode.com/sync: "true"
  # If no mgr annotations are set, prometheus scrape annotations will be set by default.
  #   mgr:
  labels:
  #   all:
  #   mon:
  #   osd:
  #   cleanup:
  #   mgr:
  #   prepareosd:
  # monitoring is a list of key-value pairs. It is injected into all the monitoring resources created by operator.
  # These labels can be passed as LabelSelector to Prometheus
  #   monitoring:
  #   crashcollector:
  resources:
  #The requests and limits set here, allow the mgr pod to use half of one CPU core and 1 gigabyte of memory
  #   mgr:
  #     limits:
  #       cpu: "500m"
  #       memory: "1024Mi"
  #     requests:
  #       cpu: "500m"
  #       memory: "1024Mi"
  # The above example requests/limits can also be added to the other components
  #   mon:
  #   osd:
  # For OSD it also is a possible to specify requests/limits based on device class
  #   osd-hdd:
  #   osd-ssd:
  #   osd-nvme:
  #   prepareosd:
  #   mgr-sidecar:
  #   crashcollector:
  #   logcollector:
  #   cleanup:
  # The option to automatically remove OSDs that are out and are safe to destroy.
  removeOSDsIfOutAndSafeToRemove: false
  priorityClassNames:
    #all: rook-ceph-default-priority-class
    mon: system-node-critical
    osd: system-node-critical
    mgr: system-cluster-critical
    #crashcollector: rook-ceph-crashcollector-priority-class
  storage: # cluster level storage configuration and selection
    useAllNodes: true
    useAllDevices: false
    #deviceFilter:
    config:
      # crushRoot: "custom-root" # specify a non-default root label for the CRUSH map
      # metadataDevice: "md0" # specify a non-rotational storage so ceph-volume will use it as block db device of bluestore.
      # databaseSizeMB: "1024" # uncomment if the disks are smaller than 100 GB
      # journalSizeMB: "1024"  # uncomment if the disks are 20 GB or smaller
      # osdsPerDevice: "1" # this value can be overridden at the node or device level
      # encryptedDevice: "true" # the default value for this option is "false"
    # Individual nodes and their config can be specified as well, but 'useAllNodes' above must be set to false. Then, only the named
    # nodes below will be used as storage resources.  Each node's 'name' field should match their 'kubernetes.io/hostname' label.
    # nodes:
    #   - name: "172.17.4.201"
    #     devices: # specific devices to use for storage can be specified for each node
    #       - name: "sdb"
    #       - name: "nvme01" # multiple osds can be created on high performance devices
    #         config:
    #           osdsPerDevice: "5"
    #       - name: "/dev/disk/by-id/ata-ST4000DM004-XXXX" # devices can be specified using full udev paths
    #     config: # configuration can be specified at the node level which overrides the cluster level config
    #   - name: "172.17.4.301"
    #     deviceFilter: "^sd."
    # when onlyApplyOSDPlacement is false, will merge both placement.All() and placement.osd
    onlyApplyOSDPlacement: false
  # The section for configuring management of daemon disruptions during upgrade or fencing.
  disruptionManagement:
    # If true, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically
    # via the strategy outlined in the [design](https://github.com/rook/rook/blob/master/design/ceph/ceph-managed-disruptionbudgets.md). The operator will
    # block eviction of OSDs by default and unblock them safely when drains are detected.
    managePodBudgets: true
    # A duration in minutes that determines how long an entire failureDomain like `region/zone/host` will be held in `noout` (in addition to the
    # default DOWN/OUT interval) when it is draining. This is only relevant when  `managePodBudgets` is `true`. The default value is `30` minutes.
    osdMaintenanceTimeout: 30
    # A duration in minutes that the operator will wait for the placement groups to become healthy (active+clean) after a drain was completed and OSDs came back up.
    # Operator will continue with the next drain if the timeout exceeds. It only works if `managePodBudgets` is `true`.
    # No values or 0 means that the operator will wait until the placement groups are healthy before unblocking the next drain.
    pgHealthCheckTimeout: 0

  # healthChecks
  # Valid values for daemons are 'mon', 'osd', 'status'
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s
    # Change pod liveness probe timing or threshold values. Works for all mon,mgr,osd daemons.
    livenessProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false
    # Change pod startup probe timing or threshold values. Works for all mon,mgr,osd daemons.
    startupProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false

Logs to submit:

Operator’s logs

2023-03-25 21:03:00.531335 D | op-mon: mons have been scheduled
2023-03-25 21:03:00.590316 I | op-mon: cleaning up canary monitor deployment "rook-ceph-mon-a-canary"
2023-03-25 21:03:00.753043 D | ceph-spec: object "rook-ceph-mon-a-canary" matched on update
2023-03-25 21:03:00.753061 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:00.753431 I | op-mon: cleaning up canary monitor deployment "rook-ceph-mon-b-canary"
2023-03-25 21:03:00.851776 D | ceph-spec: object "rook-ceph-mon-b-canary" matched on update
2023-03-25 21:03:00.851799 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:00.854251 I | op-mon: cleaning up canary monitor deployment "rook-ceph-mon-c-canary"
2023-03-25 21:03:00.898176 D | ceph-spec: object "rook-ceph-mon-a-canary" matched on update
2023-03-25 21:03:00.898194 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:00.945826 D | ceph-spec: object "rook-ceph-mon-b-canary" matched on update
2023-03-25 21:03:00.945850 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:01.030501 I | op-mon: creating mon a
2023-03-25 21:03:01.030521 I | op-mon: setting mon "a" endpoints for hostnetwork mode
2023-03-25 21:03:01.030737 D | ceph-spec: object "rook-ceph-mon-c-canary" matched on update
2023-03-25 21:03:01.030796 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:01.432075 D | ceph-spec: object "rook-ceph-mon-c-canary" matched on update
2023-03-25 21:03:01.432097 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:01.739096 D | op-mon: updating config map rook-ceph-mon-endpoints that already exists
2023-03-25 21:03:02.466470 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["[fd07:aaaa:bbbb:cccc::11]:3300"],"namespace":""}] data:a=[fd07:aaaa:bbbb:cccc::11]:3300 mapping:{"node":{"a":{"Name":"node01","Hostname":"node01","Address":"fd07:aaaa:bbbb:cccc::11"},"b":{"Name":"node02","Hostname":"node02","Address":"fd07:aaaa:bbbb:cccc::12"},"c":{"Name":"node00","Hostname":"node00","Address":"fd07:aaaa:bbbb:cccc::10"}}} maxMonId:-1 outOfQuorum:]
2023-03-25 21:03:02.466525 D | ceph-spec: object "rook-ceph-mon-endpoints" matched on update
2023-03-25 21:03:02.466576 D | ceph-spec: do not reconcile on configmap that is not "rook-config-override"
2023-03-25 21:03:02.466585 D | op-mon: mons were added or removed from the endpoints cm
2023-03-25 21:03:02.466594 I | op-mon: monitor endpoints changed, updating the bootstrap peer token
2023-03-25 21:03:02.466683 D | op-mon: mons were added or removed from the endpoints cm
2023-03-25 21:03:02.466704 I | op-mon: monitor endpoints changed, updating the bootstrap peer token
2023-03-25 21:03:03.843771 D | op-config: updating config secret "rook-ceph-config"
2023-03-25 21:03:04.163154 D | ceph-spec: object "rook-ceph-config" matched on update
2023-03-25 21:03:04.163217 D | ceph-spec: do not reconcile on "rook-ceph-config" secret changes
2023-03-25 21:03:05.023042 D | ceph-spec: object "rook-ceph-mon-a-canary" matched on update
2023-03-25 21:03:05.023068 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:05.985801 D | ceph-spec: object "rook-ceph-mon-c-canary" matched on update
2023-03-25 21:03:05.985824 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:06.220555 D | ceph-spec: object "rook-ceph-mon-b-canary" matched on update
2023-03-25 21:03:06.220572 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:11.544311 D | cephclient: No ceph configuration override to merge as "rook-config-override" configmap is empty
2023-03-25 21:03:11.544342 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2023-03-25 21:03:11.575846 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2023-03-25 21:03:11.575878 D | ceph-csi: evaluating mon "a" for msgr1 on endpoint "[fd07:aaaa:bbbb:cccc::11]:3300"
2023-03-25 21:03:11.575895 D | ceph-csi: using "rook-ceph" for csi configmap namespace
2023-03-25 21:03:12.217116 D | ceph-spec: object "rook-ceph-mon-c-canary" matched on update
2023-03-25 21:03:12.217137 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:12.227888 I | op-mon: 0 of 1 expected mons are ready. creating or updating deployments without checking quorum in attempt to achieve a healthy mon cluster
2023-03-25 21:03:12.227977 D | op-mon: monConfig: &{ResourceName:rook-ceph-mon-a DaemonName:a PublicIP:fd07:aaaa:bbbb:cccc::11 Port:3300 Zone: NodeName: DataPathMap:0xc0017e5a10 UseHostNetwork:true}
2023-03-25 21:03:12.228100 D | ceph-spec: setting periodicity to "daily". Supported periodicity are hourly, daily, weekly and monthly
2023-03-25 21:03:12.416104 D | op-mon: adding host path volume source to mon deployment rook-ceph-mon-a
2023-03-25 21:03:12.416127 D | op-mon: Starting mon: rook-ceph-mon-a
2023-03-25 21:03:12.899727 D | ceph-spec: object "rook-ceph-mon-c-canary" did not match on delete
2023-03-25 21:03:12.899749 D | ceph-spec: object "rook-ceph-mon-c-canary" did not match on delete
2023-03-25 21:03:12.899768 D | ceph-spec: object "rook-ceph-mon-c-canary" did not match on delete
2023-03-25 21:03:12.899778 D | ceph-spec: object "rook-ceph-mon-c-canary" did not match on delete
2023-03-25 21:03:12.899785 D | ceph-spec: object "rook-ceph-mon-c-canary" did not match on delete
2023-03-25 21:03:12.899793 D | ceph-spec: do not reconcile "rook-ceph-mon-c-canary" on monitor canary deployments
2023-03-25 21:03:13.223633 I | op-mon: updating maxMonID from -1 to 0
2023-03-25 21:03:13.403333 D | ceph-spec: object "rook-ceph-mon-endpoints" matched on update
2023-03-25 21:03:13.403371 D | ceph-spec: do not reconcile on configmap that is not "rook-config-override"
2023-03-25 21:03:13.414013 D | ceph-spec: object "rook-ceph-mon-a" matched on update
2023-03-25 21:03:13.414029 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:13.783491 D | ceph-spec: object "rook-ceph-mon-a" matched on update
2023-03-25 21:03:13.783521 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:13.793732 D | op-mon: updating config map rook-ceph-mon-endpoints that already exists
2023-03-25 21:03:13.919764 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["[fd07:aaaa:bbbb:cccc::11]:3300"],"namespace":""}] data:a=[fd07:aaaa:bbbb:cccc::11]:3300 mapping:{"node":{"a":{"Name":"node01","Hostname":"node01","Address":"fd07:aaaa:bbbb:cccc::11"},"b":{"Name":"node02","Hostname":"node02","Address":"fd07:aaaa:bbbb:cccc::12"},"c":{"Name":"node00","Hostname":"node00","Address":"fd07:aaaa:bbbb:cccc::10"}}} maxMonId:0 outOfQuorum:]
2023-03-25 21:03:13.919789 I | op-mon: waiting for mon quorum with [a]
2023-03-25 21:03:13.919829 D | ceph-spec: object "rook-ceph-mon-a" matched on update
2023-03-25 21:03:13.919844 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:13.938648 I | op-mon: mons running: [a]
2023-03-25 21:03:13.938675 D | exec: Running command: ceph quorum_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-03-25 21:03:14.971001 D | ceph-nodedaemon-controller: "rook-ceph-mon-a-68bc4db5f8-6nxwx" is a ceph pod!
2023-03-25 21:03:14.971106 D | ceph-nodedaemon-controller: reconciling node: "node01"
2023-03-25 21:03:14.971514 D | ceph-nodedaemon-controller: secret "rook-ceph-crash-collector-keyring" in namespace "rook-ceph" not found. retrying in "30s". Secret "rook-ceph-crash-collector-keyring" not found
2023-03-25 21:03:15.188717 D | ceph-spec: object "rook-ceph-mon-a-canary" matched on update
2023-03-25 21:03:15.188743 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:15.297744 D | ceph-spec: object "rook-ceph-mon-a-canary" did not match on delete
2023-03-25 21:03:15.297780 D | ceph-spec: object "rook-ceph-mon-a-canary" did not match on delete
2023-03-25 21:03:15.297797 D | ceph-spec: do not reconcile "rook-ceph-mon-a-canary" on monitor canary deployments
2023-03-25 21:03:15.297812 D | ceph-spec: object "rook-ceph-mon-a-canary" did not match on delete
2023-03-25 21:03:15.297823 D | ceph-spec: object "rook-ceph-mon-a-canary" did not match on delete
2023-03-25 21:03:15.297844 D | ceph-spec: object "rook-ceph-mon-a-canary" did not match on delete
2023-03-25 21:03:29.196891 D | op-mon: failed to get quorum_status. mon quorum status failed: exit status 1
2023-03-25 21:03:34.622280 I | op-mon: mons running: [a]
2023-03-25 21:03:34.622315 D | exec: Running command: ceph quorum_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-03-25 21:03:39.365173 D | ceph-spec: object "rook-ceph-mon-a" matched on update
2023-03-25 21:03:39.365223 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:44.972072 D | ceph-nodedaemon-controller: reconciling node: "node01"
2023-03-25 21:03:44.972542 D | ceph-nodedaemon-controller: secret "rook-ceph-crash-collector-keyring" in namespace "rook-ceph" not found. retrying in "30s". Secret "rook-ceph-crash-collector-keyring" not found
2023-03-25 21:03:49.873908 D | op-mon: failed to get quorum_status. mon quorum status failed: exit status 1
2023-03-25 21:03:52.426930 D | ceph-spec: object "rook-ceph-mon-b-canary" matched on update
2023-03-25 21:03:52.426952 D | ceph-spec: do not reconcile deployments updates
2023-03-25 21:03:52.491993 D | ceph-spec: object "rook-ceph-mon-b-canary" did not match on delete
2023-03-25 21:03:52.492010 D | ceph-spec: object "rook-ceph-mon-b-canary" did not match on delete
2023-03-25 21:03:52.492019 D | ceph-spec: do not reconcile "rook-ceph-mon-b-canary" on monitor canary deployments
2023-03-25 21:03:52.492027 D | ceph-spec: object "rook-ceph-mon-b-canary" did not match on delete
2023-03-25 21:03:52.492043 D | ceph-spec: object "rook-ceph-mon-b-canary" did not match on delete
2023-03-25 21:03:52.492051 D | ceph-spec: object "rook-ceph-mon-b-canary" did not match on delete
2023-03-25 21:03:54.893200 I | op-mon: mons running: [a]
2023-03-25 21:03:54.893234 D | exec: Running command: ceph quorum_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-03-25 21:04:10.145461 D | op-mon: failed to get quorum_status. mon quorum status failed: exit status 1
2023-03-25 21:04:14.973439 D | ceph-nodedaemon-controller: reconciling node: "node01"
2023-03-25 21:04:14.973875 D | ceph-nodedaemon-controller: secret "rook-ceph-crash-collector-keyring" in namespace "rook-ceph" not found. retrying in "30s". Secret "rook-ceph-crash-collector-keyring" not found
2023-03-25 21:04:15.163256 I | op-mon: mons running: [a]
2023-03-25 21:04:15.163288 D | exec: Running command: ceph quorum_status --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2023-03-25 21:04:30.425915 D | op-mon: failed to get quorum_status. mon quorum status failed: exit status 1

MON pod logs

debug 2023-03-25T21:03:23.508+0000 7f3991cbc880  0 mon.a does not exist in monmap, will attempt to join an existing cluster
debug 2023-03-25T21:03:23.508+0000 7f3991cbc880  0 using public_addr v2:[fd07:aaaa:bbbb:cccc::11]:0/0 -> [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0]
debug 2023-03-25T21:03:23.511+0000 7f3991cbc880  0 starting mon.a rank -1 at public addrs [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] at bind addrs [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] mon_data /var/lib/ceph/mon/ceph-a fsid c9287f97-61b0-4848-a5e0-6119776fc750
debug 2023-03-25T21:03:23.511+0000 7f3991cbc880  1 mon.a@-1(???) e0 preinit fsid c9287f97-61b0-4848-a5e0-6119776fc750
debug 2023-03-25T21:03:23.511+0000 7f3991cbc880  1 mon.a@-1(???) e0  initial_members a, filtering seed monmap
debug 2023-03-25T21:03:23.515+0000 7f3991cbc880  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:25.515+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:27.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:29.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:31.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:33.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:35.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:35.835+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
debug 2023-03-25T21:03:35.835+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
debug 2023-03-25T21:03:37.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:39.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:41.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:43.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:45.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:45.675+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
debug 2023-03-25T21:03:45.675+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
debug 2023-03-25T21:03:47.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:49.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:51.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:53.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:55.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:55.658+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
debug 2023-03-25T21:03:55.658+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
debug 2023-03-25T21:03:57.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:03:59.518+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:01.519+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:03.519+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:05.522+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:05.642+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
debug 2023-03-25T21:04:05.642+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
debug 2023-03-25T21:04:07.522+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:08.519+0000 7f3987727700 -1 mon.a@-1(probing) e0 get_health_metrics reporting 2 slow ops, oldest is log(1 entries from seq 1 at 2023-03-25T21:03:35.837543+0000)
debug 2023-03-25T21:04:09.522+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:11.522+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:13.519+0000 7f3987727700 -1 mon.a@-1(probing) e0 get_health_metrics reporting 2 slow ops, oldest is log(1 entries from seq 1 at 2023-03-25T21:03:35.837543+0000)
debug 2023-03-25T21:04:13.522+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:15.522+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:15.665+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
debug 2023-03-25T21:04:15.665+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
debug 2023-03-25T21:04:17.522+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:18.519+0000 7f3987727700 -1 mon.a@-1(probing) e0 get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 at 2023-03-25T21:03:35.837543+0000)
debug 2023-03-25T21:04:19.522+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:21.522+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:23.519+0000 7f3987727700 -1 mon.a@-1(probing) e0 get_health_metrics reporting 4 slow ops, oldest is log(1 entries from seq 1 at 2023-03-25T21:03:35.837543+0000)
debug 2023-03-25T21:04:23.522+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:25.522+0000 7f3987727700  0 -- [v2:[fd07:aaaa:bbbb:cccc::11]:3300/0,v1:[fd07:aaaa:bbbb:cccc::11]:6789/0] send_to message mon_probe(probe c9287f97-61b0-4848-a5e0-6119776fc750 name a leader -1 new mon_release quincy) v8 with empty dest
debug 2023-03-25T21:04:25.652+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
debug 2023-03-25T21:04:25.652+0000 7f398b12a700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished

Environment:

Kernel (e.g. uname -a): 6.1.7
Cloud provider or hardware configuration: Bare Metal, 3 nodes
Rook version (use rook version inside of a Rook Pod): rook/ceph:v1.11.2-4.g9928f26cc
Kubernetes version (use kubectl version): 1.26.1
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Bare Metal

About this issue

Original URL
State: closed
Created a year ago
Comments: 34 (34 by maintainers)

Most upvoted comments

I have 3 mons now, all in quorum, and the rest of the cluster is being brought online it looks like. I think that solved the issue!

heliochronix on Mar 29, 2023

I believe when requireMsgr2: false is set, it results in the conditional failing and both v2 and v1 endpoints are generated by the else:

cat /var/lib/rook/rook-ceph/rook-ceph.config
[global]
fsid                = 0e8c95b2-4745-4788-8d6f-d50067c6ab54
mon initial members = a
mon host            = [v2:[fd07:aaaa:bbbb:cccc::11]:3300,v1:[fd07:aaaa:bbbb:cccc::11]:6789]

[client.admin]
keyring = /var/lib/rook/rook-ceph/client.admin.keyring

That shows what the result of using net.JoinHostPort(monIP, strconv.Itoa(int(Msgr2port))) would be. However, just to verify, I am currently trying to set up a quay.io account to push my test build.

heliochronix on Mar 29, 2023

Okay, this might take me a few minutes to get situated. This will be my first time pushing an image upstream somewhere. The build seems to be going fine so far, so I’ll get a quay account set up while that’s building.

heliochronix on Mar 29, 2023

If I’m not mistaken, the mgrs are usually created after initial mon creation and quorum, right? So we would not expect any mgrs to have been created yet, since this is the bootstrapping of the first mon in the cluster creation process?

heliochronix on Mar 27, 2023