rook: Networkfences did not show all nodes with taints applied

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

I had 2 nodes down (k3s02 and k3s05), both with taints applied. However, this only shows one of the 2:

$ kubectl get networkfences.csiaddons.openshift.io
NAME    DRIVER                       CIDRS                   FENCESTATE   AGE    RESULT
k3s05   rook-ceph.rbd.csi.ceph.com   ["192.168.10.215/32"]   Fenced       2d9h   Succeeded

If the CIDRS is supposed to have the network address of the fenced node, its showing the wrong one. k3s05 is 192.168.10.219/32 and k3s02 is 192.168.10.216/32

Expected behavior:

I was expecting both nodes to be listed.

This shows OSDs on k3s02 and k3s05 are down:

# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
 -1         6.89755  root default                             
 -5         0.68359      host k3s01                           
  0   nvme  0.68359          osd.0       up   1.00000  1.00000
 -3         0.97659      host k3s02                           
  1   nvme  0.97659          osd.1     down         0  1.00000
 -7         0.68359      host k3s03                           
  2   nvme  0.68359          osd.2       up   1.00000  1.00000
 -9         1.36719      host k3s04                           
  3   nvme  0.68359          osd.3       up   1.00000  1.00000
  7    ssd  0.68359          osd.7       up   1.00000  1.00000
-11         1.36719      host k3s05                           
  4   nvme  0.68359          osd.4     down         0  1.00000
  8    ssd  0.68359          osd.8     down         0  1.00000
-19         1.81940      host k3s06                           
  5    ssd  0.90970          osd.5       up   1.00000  1.00000
  6    ssd  0.90970          osd.6       up   1.00000  1.00000

Upon bringing k3s02 back online, OSD.1 correctly failed to schedule due to taints:

OSD.1 FailedScheduling
0/6 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/out-of-service: nodeshutdown}, 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..

Confirmed taints are in place:

$ k describe node k3s02
...
Taints:             node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
                    node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  k3s02
  AcquireTime:     <unset>
  RenewTime:       Mon, 18 Sep 2023 21:13:59 -0400
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 18 Sep 2023 21:13:40 -0400   Mon, 18 Sep 2023 21:03:27 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 18 Sep 2023 21:13:40 -0400   Mon, 18 Sep 2023 21:03:27 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 18 Sep 2023 21:13:40 -0400   Mon, 18 Sep 2023 21:03:27 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 18 Sep 2023 21:13:40 -0400   Mon, 18 Sep 2023 21:03:27 -0400   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.10.216
  Hostname:    k3s02
...

I removed the taints:

$ kubectl taint nodes k3s02 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute-
node/k3s02 untainted
$ kubectl taint nodes k3s02 node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule-
node/k3s02 untainted

No change to this:

$ kubectl get networkfences.csiaddons.openshift.io
NAME    DRIVER                       CIDRS                   FENCESTATE   AGE     RESULT
k3s05   rook-ceph.rbd.csi.ceph.com   ["192.168.10.215/32"]   Fenced       2d10h   Succeeded

The OSD.1 could now be scheduled, the logs show was damaged (it was then deleted and rebuilt):

OSD.1 logs:
 -1070> 2023-09-19T01:19:46.206+0000 7f96386c0880 -1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file)
 -1069> 2023-09-19T01:19:47.858+0000 7f96386c0880 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/os/bluestore/BlueStore.cc: In function 'virtual void BlueStore::ExtentDecoderPartial::consume_blobid(BlueStore::Extent*, bool, uint64_t)' thread 7f96386c0880 time 2023-09-19T01:19:47.860533+0000

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/os/bluestore/BlueStore.cc: 18940: FAILED ceph_assert(it != map.end())

 ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f9636d07489]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x26a64f) [0x7f9636d0764f]
 3: (BlueStore::ExtentDecoderPartial::consume_blobid(BlueStore::Extent*, bool, unsigned long)+0x242) [0x55bd8a303e42]
 4: (BlueStore::ExtentMap::ExtentDecoder::decode_extent(BlueStore::Extent*, unsigned char, ceph::buffer::v15_2_0::ptr::iterator_impl<true>&, BlueStore::Collection*)+0xe1) [0x55bd8a2e8ea1]
 5: (BlueStore::ExtentMap::ExtentDecoder::decode_some(ceph::buffer::v15_2_0::list const&, BlueStore::Collection*)+0x120) [0x55bd8a2e92f0]
 6: (BlueStore::read_allocation_from_onodes(SimpleBitmap*, BlueStore::read_alloc_stats_t&)+0x1af0) [0x55bd8a300d90]
 7: (BlueStore::reconstruct_allocations(SimpleBitmap*, BlueStore::read_alloc_stats_t&)+0x5a) [0x55bd8a3019da]
 8: (BlueStore::read_allocation_from_drive_on_startup()+0x105) [0x55bd8a332485]
 9: (BlueStore::_init_alloc(std::map<unsigned long, unsigned long, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, unsigned long> > >*)+0xb44) [0x55bd8a333254]
 10: (BlueStore::_open_db_and_around(bool, bool)+0x4d7) [0x55bd8a360bb7]
 11: (BlueStore::expand_devices(std::ostream&)+0x3a) [0x55bd8a363fba]
 12: main()
 13: __libc_start_main()
 14: _start()

Cluster Status to submit:

Output of krew commands, if necessary

$ kubectl rook-ceph health
Info: Checking if at least three mon pods are running on different nodes
rook-ceph-mon-o-5fff6fc4cc-jssgc        Running rook-ceph       k3s03
rook-ceph-mon-s-5d5944869b-d54t8        Running rook-ceph       k3s06
rook-ceph-mon-v-74b98cc67c-tznxr        Running rook-ceph       k3s04

Info: Checking mon quorum and ceph health details
Info: HEALTH_OK

Info: Checking if at least three osd pods are running on different nodes
rook-ceph-osd-0-bf7c94d97-ddsfb Running rook-ceph       k3s01
rook-ceph-osd-1-6f9dfdd9b7-jbj7r        Running rook-ceph       k3s02
rook-ceph-osd-2-5cf49dd98d-28942        Running rook-ceph       k3s03
rook-ceph-osd-3-77945cc7ff-jq27t        Running rook-ceph       k3s04
rook-ceph-osd-4-75cc6f786f-4snzx        Pending rook-ceph
rook-ceph-osd-5-7cd9c9877f-dmtcn        Running rook-ceph       k3s06
rook-ceph-osd-6-665469c65-d72rs Running rook-ceph       k3s06
rook-ceph-osd-7-d4d6cc496-ttmmg Running rook-ceph       k3s04
rook-ceph-osd-8-86b65b584-tswxq Pending rook-ceph

Info: Pods that are in 'Running' or `Succeeded` status
csi-cephfsplugin-6b7z4   Running         rook-ceph       k3s01
csi-cephfsplugin-b2ngd   Running         rook-ceph       k3s02
csi-cephfsplugin-l7mpk   Running         rook-ceph       k3s03
csi-cephfsplugin-ncz6t   Running         rook-ceph       k3s06
csi-cephfsplugin-provisioner-569b6d6cbd-jdmmh    Running         rook-ceph       k3s01
csi-cephfsplugin-provisioner-569b6d6cbd-sgzzj    Running         rook-ceph       k3s04
csi-cephfsplugin-qmwpg   Running         rook-ceph       k3s04
csi-rbdplugin-29xmg      Running         rook-ceph       k3s03
csi-rbdplugin-4pj5x      Running         rook-ceph       k3s06
csi-rbdplugin-ksgtx      Running         rook-ceph       k3s02
csi-rbdplugin-m2hwg      Running         rook-ceph       k3s01
csi-rbdplugin-provisioner-666699b494-6t25x       Running         rook-ceph       k3s04
csi-rbdplugin-provisioner-666699b494-qxmr4       Running         rook-ceph       k3s01
csi-rbdplugin-qggw8      Running         rook-ceph       k3s04
rook-ceph-crashcollector-k3s01-5bc67549fc-nbcsz          Running         rook-ceph       k3s01
rook-ceph-crashcollector-k3s02-9d694c5dd-dpnr8   Running         rook-ceph       k3s02
rook-ceph-crashcollector-k3s03-858794b685-bjr9c          Running         rook-ceph       k3s03
rook-ceph-crashcollector-k3s04-5db575bfc4-kvg9j          Running         rook-ceph       k3s04
rook-ceph-crashcollector-k3s06-647997b9f6-dlcqm          Running         rook-ceph       k3s06
rook-ceph-crashcollector-pruner-28251600-b6svz   Succeeded       rook-ceph       k3s01
rook-ceph-mds-ceph-filesystem-a-5cfd7fb857-gtst9         Running         rook-ceph       k3s04
rook-ceph-mds-ceph-filesystem-b-77499c7cd4-8hd6n         Running         rook-ceph       k3s04
rook-ceph-mgr-a-78679477d-fwcfd          Running         rook-ceph       k3s04
rook-ceph-mgr-b-7d4776bd8c-wv4s5         Running         rook-ceph       k3s06
rook-ceph-mon-o-5fff6fc4cc-jssgc         Running         rook-ceph       k3s03
rook-ceph-mon-s-5d5944869b-d54t8         Running         rook-ceph       k3s06
rook-ceph-mon-v-74b98cc67c-tznxr         Running         rook-ceph       k3s04
rook-ceph-operator-f89f679d9-jjgjs       Running         rook-ceph       k3s02
rook-ceph-osd-0-bf7c94d97-ddsfb          Running         rook-ceph       k3s01
rook-ceph-osd-1-6f9dfdd9b7-jbj7r         Running         rook-ceph       k3s02
rook-ceph-osd-2-5cf49dd98d-28942         Running         rook-ceph       k3s03
rook-ceph-osd-3-77945cc7ff-jq27t         Running         rook-ceph       k3s04
rook-ceph-osd-5-7cd9c9877f-dmtcn         Running         rook-ceph       k3s06
rook-ceph-osd-6-665469c65-d72rs          Running         rook-ceph       k3s06
rook-ceph-osd-7-d4d6cc496-ttmmg          Running         rook-ceph       k3s04
rook-ceph-osd-prepare-k3s01-v5rxw        Succeeded       rook-ceph       k3s01
rook-ceph-osd-prepare-k3s02-8gw8g        Succeeded       rook-ceph       k3s02
rook-ceph-osd-prepare-k3s03-hgs95        Succeeded       rook-ceph       k3s03
rook-ceph-osd-prepare-k3s04-s6rzh        Succeeded       rook-ceph       k3s04
rook-ceph-osd-prepare-k3s06-6lnxp        Succeeded       rook-ceph       k3s06
rook-ceph-rgw-ceph-objectstore-a-7c49497cfd-j9ptl        Running         rook-ceph       k3s01
rook-ceph-rgw-ceph-objectstore-a-7c49497cfd-x7654        Running         rook-ceph       k3s06
rook-ceph-tools-d447c8f95-h7gwq          Running         rook-ceph       k3s06
rook-discover-6xr5g      Running         rook-ceph       k3s04
rook-discover-d7s6q      Running         rook-ceph       k3s01
rook-discover-jcf47      Running         rook-ceph       k3s03
rook-discover-qdmzz      Running         rook-ceph       k3s06
rook-discover-xx99c      Running         rook-ceph       k3s02

Warning: Pods that are 'Not' in 'Running' status
rook-ceph-osd-4-75cc6f786f-4snzx         Pending         rook-ceph       
rook-ceph-osd-8-86b65b584-tswxq          Pending         rook-ceph       

Info: Checking placement group status
Info:   PgState: active+clean, PgCount: 193

Info: Checking if at least one mgr pod is running
rook-ceph-mgr-a-78679477d-fwcfd Running rook-ceph       k3s04
rook-ceph-mgr-b-7d4776bd8c-wv4s5        Running rook-ceph       k3s06

$ kubectl rook-ceph ceph status
Info: running 'ceph' command with args: [status]
  cluster:
    id:     cb82340a-2eaf-4597-b83e-cc0e62a9d019
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum o,s,v (age 8d)
    mgr: b(active, since 8d), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 9 osds: 7 up (since 11h), 7 in (since 11h)
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 193 pgs
    objects: 117.63k objects, 305 GiB
    usage:   958 GiB used, 4.6 TiB / 5.5 TiB avail
    pgs:     193 active+clean
 
  io:
    client:   852 B/s rd, 207 KiB/s wr, 1 op/s rd, 7 op/s wr

Environment:

OS (e.g. from /etc/os-release): Ubuntu 22.04.3 LTS
Kernel (e.g. uname -a): Linux k3s02 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: Bare Metal
Rook version (use rook version inside of a Rook Pod): rook: v1.12.3 / go: go1.21.0
Storage backend version (e.g. for ceph do ceph -v): ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Kubernetes version (use kubectl version): Server Version: v1.27.5+k3s1
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): K3S
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

About this issue

Original URL
State: closed
Created 9 months ago
Comments: 20 (9 by maintainers)

Most upvoted comments

@reefland then you previous output was expected since there is no volumeInUse with ceph RBD no networkFence cr will be created

subhamkrai on Oct 26, 2023

Thank you for explaining that.

So now it’s unclear to me why that volume was unable to be freed and mounted when the fence was applied.

reefland on Oct 25, 2023