rook: PGs are stuck in "inactive" state after cluster initialisation

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: After initialising a test single-node bare-metal cluster, checking health with ceph health detail reveals an unhealthy cluster state:

HEALTH_WARN Reduced data availability: 1 pgs inactive
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pgs inactive
    pg 1.0 is stuck inactive for 1h, current state unknown, last acting []

Further attempts to create Storage Class and PVC result in PVCs being stuck in Pending.

Expected behavior:

Healthy cluster with functioning PVCs.

How to reproduce it (minimal and precise):

# initialise Rook on fresh k8s cluster as per documentation:
kubectl create -f rook/cluster/examples/kubernetes/ceph/common.yaml
kubectl create -f rook/cluster/examples/kubernetes/ceph/operator.yaml
kubectl create -f rook/cluster/examples/kubernetes/ceph/cluster-test.yaml

# deploy toolbox:
kubectl apply -f rook/cluster/examples/kubernetes/ceph/toolbox.yaml

# login to a toolbox and query cluster health:
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

# notice: when I launched right after creation of a cluster, there was 1 inactive PG reported
# after leaving cluster for few hours, there are 33 of them
> ceph -s
  cluster:
    id:     bd9c4d9d-7fcc-4771-82e5-aca2dd144575
    health: HEALTH_WARN
            Reduced data availability: 33 pgs inactive

  services:
    mon: 1 daemons, quorum a (age 5h)
    mgr: a(active, since 5h)
    osd: 1 osds: 1 up (since 5h), 1 in (since 5h)

  data:
    pools:   2 pools, 33 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             33 unknown

File(s) to submit:

Environment:

  • OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"
  • Kernel (e.g. uname -a): Linux kmaster 4.18.0-193.19.1.el8_2.x86_64 #1 SMP Mon Sep 14 14:37:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Cloud provider or hardware configuration: Bare-metal single-node cluster based on Intel NUC 10 (please let me know if further details are required). Single hard drive, raw partition created for Ceph.
  • Rook version (use rook version inside of a Rook Pod):
rook: v1.4.5
go: go1.13.8
  • Storage backend version (e.g. for ceph do ceph -v): ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:41:49Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Bare metal
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): see output above

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 15 (5 by maintainers)

Most upvoted comments

@Juriy Have you sorted this out? I’m a bit confused by the outputs:

  • first there are 2 pools, with 33 PGs total
  • then there is 1 pgs inactive
  • then a single pool with 1PG

I see you have a rule which looks correct to me, not sure what’s going on.