rook: PGs are stuck in "inactive" state after cluster initialisation

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: After initialising a test single-node bare-metal cluster, checking health with ceph health detail reveals an unhealthy cluster state:

HEALTH_WARN Reduced data availability: 1 pgs inactive
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pgs inactive
    pg 1.0 is stuck inactive for 1h, current state unknown, last acting []

Further attempts to create Storage Class and PVC result in PVCs being stuck in Pending.

Expected behavior:

Healthy cluster with functioning PVCs.

How to reproduce it (minimal and precise):

# initialise Rook on fresh k8s cluster as per documentation:
kubectl create -f rook/cluster/examples/kubernetes/ceph/common.yaml
kubectl create -f rook/cluster/examples/kubernetes/ceph/operator.yaml
kubectl create -f rook/cluster/examples/kubernetes/ceph/cluster-test.yaml

# deploy toolbox:
kubectl apply -f rook/cluster/examples/kubernetes/ceph/toolbox.yaml

# login to a toolbox and query cluster health:
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

# notice: when I launched right after creation of a cluster, there was 1 inactive PG reported
# after leaving cluster for few hours, there are 33 of them
> ceph -s
  cluster:
    id:     bd9c4d9d-7fcc-4771-82e5-aca2dd144575
    health: HEALTH_WARN
            Reduced data availability: 33 pgs inactive

  services:
    mon: 1 daemons, quorum a (age 5h)
    mgr: a(active, since 5h)
    osd: 1 osds: 1 up (since 5h), 1 in (since 5h)

  data:
    pools:   2 pools, 33 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             33 unknown

File(s) to submit:

Environment:

OS (e.g. from /etc/os-release):

NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"

Kernel (e.g. uname -a): Linux kmaster 4.18.0-193.19.1.el8_2.x86_64 #1 SMP Mon Sep 14 14:37:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: Bare-metal single-node cluster based on Intel NUC 10 (please let me know if further details are required). Single hard drive, raw partition created for Ceph.
Rook version (use rook version inside of a Rook Pod):

rook: v1.4.5
go: go1.13.8

Storage backend version (e.g. for ceph do ceph -v): ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:41:49Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Bare metal
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): see output above

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 2
Comments: 15 (5 by maintainers)

Most upvoted comments

@Juriy Have you sorted this out? I’m a bit confused by the outputs:

first there are 2 pools, with 33 PGs total
then there is 1 pgs inactive
then a single pool with 1PG

I see you have a rule which looks correct to me, not sure what’s going on.

leseb on Nov 9, 2020