rook: Networkfences did not show all nodes with taints applied
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior:
I had 2 nodes down (k3s02 and k3s05), both with taints applied. However, this only shows one of the 2:
$ kubectl get networkfences.csiaddons.openshift.io
NAME DRIVER CIDRS FENCESTATE AGE RESULT
k3s05 rook-ceph.rbd.csi.ceph.com ["192.168.10.215/32"] Fenced 2d9h Succeeded
- If the
CIDRSis supposed to have the network address of the fenced node, its showing the wrong one.k3s05is192.168.10.219/32andk3s02is192.168.10.216/32
Expected behavior:
I was expecting both nodes to be listed.
This shows OSDs on k3s02 and k3s05 are down:
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 6.89755 root default
-5 0.68359 host k3s01
0 nvme 0.68359 osd.0 up 1.00000 1.00000
-3 0.97659 host k3s02
1 nvme 0.97659 osd.1 down 0 1.00000
-7 0.68359 host k3s03
2 nvme 0.68359 osd.2 up 1.00000 1.00000
-9 1.36719 host k3s04
3 nvme 0.68359 osd.3 up 1.00000 1.00000
7 ssd 0.68359 osd.7 up 1.00000 1.00000
-11 1.36719 host k3s05
4 nvme 0.68359 osd.4 down 0 1.00000
8 ssd 0.68359 osd.8 down 0 1.00000
-19 1.81940 host k3s06
5 ssd 0.90970 osd.5 up 1.00000 1.00000
6 ssd 0.90970 osd.6 up 1.00000 1.00000
Upon bringing k3s02 back online, OSD.1 correctly failed to schedule due to taints:
OSD.1 FailedScheduling
0/6 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/out-of-service: nodeshutdown}, 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
Confirmed taints are in place:
$ k describe node k3s02
...
Taints: node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: k3s02
AcquireTime: <unset>
RenewTime: Mon, 18 Sep 2023 21:13:59 -0400
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Mon, 18 Sep 2023 21:13:40 -0400 Mon, 18 Sep 2023 21:03:27 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 18 Sep 2023 21:13:40 -0400 Mon, 18 Sep 2023 21:03:27 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 18 Sep 2023 21:13:40 -0400 Mon, 18 Sep 2023 21:03:27 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 18 Sep 2023 21:13:40 -0400 Mon, 18 Sep 2023 21:03:27 -0400 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.10.216
Hostname: k3s02
...
I removed the taints:
$ kubectl taint nodes k3s02 node.kubernetes.io/out-of-service=nodeshutdown:NoExecute-
node/k3s02 untainted
$ kubectl taint nodes k3s02 node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule-
node/k3s02 untainted
No change to this:
$ kubectl get networkfences.csiaddons.openshift.io
NAME DRIVER CIDRS FENCESTATE AGE RESULT
k3s05 rook-ceph.rbd.csi.ceph.com ["192.168.10.215/32"] Fenced 2d10h Succeeded
The OSD.1 could now be scheduled, the logs show was damaged (it was then deleted and rebuilt):
OSD.1 logs:
-1070> 2023-09-19T01:19:46.206+0000 7f96386c0880 -1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file)
-1069> 2023-09-19T01:19:47.858+0000 7f96386c0880 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/os/bluestore/BlueStore.cc: In function 'virtual void BlueStore::ExtentDecoderPartial::consume_blobid(BlueStore::Extent*, bool, uint64_t)' thread 7f96386c0880 time 2023-09-19T01:19:47.860533+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6/rpm/el8/BUILD/ceph-17.2.6/src/os/bluestore/BlueStore.cc: 18940: FAILED ceph_assert(it != map.end())
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f9636d07489]
2: /usr/lib64/ceph/libceph-common.so.2(+0x26a64f) [0x7f9636d0764f]
3: (BlueStore::ExtentDecoderPartial::consume_blobid(BlueStore::Extent*, bool, unsigned long)+0x242) [0x55bd8a303e42]
4: (BlueStore::ExtentMap::ExtentDecoder::decode_extent(BlueStore::Extent*, unsigned char, ceph::buffer::v15_2_0::ptr::iterator_impl<true>&, BlueStore::Collection*)+0xe1) [0x55bd8a2e8ea1]
5: (BlueStore::ExtentMap::ExtentDecoder::decode_some(ceph::buffer::v15_2_0::list const&, BlueStore::Collection*)+0x120) [0x55bd8a2e92f0]
6: (BlueStore::read_allocation_from_onodes(SimpleBitmap*, BlueStore::read_alloc_stats_t&)+0x1af0) [0x55bd8a300d90]
7: (BlueStore::reconstruct_allocations(SimpleBitmap*, BlueStore::read_alloc_stats_t&)+0x5a) [0x55bd8a3019da]
8: (BlueStore::read_allocation_from_drive_on_startup()+0x105) [0x55bd8a332485]
9: (BlueStore::_init_alloc(std::map<unsigned long, unsigned long, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, unsigned long> > >*)+0xb44) [0x55bd8a333254]
10: (BlueStore::_open_db_and_around(bool, bool)+0x4d7) [0x55bd8a360bb7]
11: (BlueStore::expand_devices(std::ostream&)+0x3a) [0x55bd8a363fba]
12: main()
13: __libc_start_main()
14: _start()
Cluster Status to submit:
- Output of krew commands, if necessary
$ kubectl rook-ceph health
Info: Checking if at least three mon pods are running on different nodes
rook-ceph-mon-o-5fff6fc4cc-jssgc Running rook-ceph k3s03
rook-ceph-mon-s-5d5944869b-d54t8 Running rook-ceph k3s06
rook-ceph-mon-v-74b98cc67c-tznxr Running rook-ceph k3s04
Info: Checking mon quorum and ceph health details
Info: HEALTH_OK
Info: Checking if at least three osd pods are running on different nodes
rook-ceph-osd-0-bf7c94d97-ddsfb Running rook-ceph k3s01
rook-ceph-osd-1-6f9dfdd9b7-jbj7r Running rook-ceph k3s02
rook-ceph-osd-2-5cf49dd98d-28942 Running rook-ceph k3s03
rook-ceph-osd-3-77945cc7ff-jq27t Running rook-ceph k3s04
rook-ceph-osd-4-75cc6f786f-4snzx Pending rook-ceph
rook-ceph-osd-5-7cd9c9877f-dmtcn Running rook-ceph k3s06
rook-ceph-osd-6-665469c65-d72rs Running rook-ceph k3s06
rook-ceph-osd-7-d4d6cc496-ttmmg Running rook-ceph k3s04
rook-ceph-osd-8-86b65b584-tswxq Pending rook-ceph
Info: Pods that are in 'Running' or `Succeeded` status
csi-cephfsplugin-6b7z4 Running rook-ceph k3s01
csi-cephfsplugin-b2ngd Running rook-ceph k3s02
csi-cephfsplugin-l7mpk Running rook-ceph k3s03
csi-cephfsplugin-ncz6t Running rook-ceph k3s06
csi-cephfsplugin-provisioner-569b6d6cbd-jdmmh Running rook-ceph k3s01
csi-cephfsplugin-provisioner-569b6d6cbd-sgzzj Running rook-ceph k3s04
csi-cephfsplugin-qmwpg Running rook-ceph k3s04
csi-rbdplugin-29xmg Running rook-ceph k3s03
csi-rbdplugin-4pj5x Running rook-ceph k3s06
csi-rbdplugin-ksgtx Running rook-ceph k3s02
csi-rbdplugin-m2hwg Running rook-ceph k3s01
csi-rbdplugin-provisioner-666699b494-6t25x Running rook-ceph k3s04
csi-rbdplugin-provisioner-666699b494-qxmr4 Running rook-ceph k3s01
csi-rbdplugin-qggw8 Running rook-ceph k3s04
rook-ceph-crashcollector-k3s01-5bc67549fc-nbcsz Running rook-ceph k3s01
rook-ceph-crashcollector-k3s02-9d694c5dd-dpnr8 Running rook-ceph k3s02
rook-ceph-crashcollector-k3s03-858794b685-bjr9c Running rook-ceph k3s03
rook-ceph-crashcollector-k3s04-5db575bfc4-kvg9j Running rook-ceph k3s04
rook-ceph-crashcollector-k3s06-647997b9f6-dlcqm Running rook-ceph k3s06
rook-ceph-crashcollector-pruner-28251600-b6svz Succeeded rook-ceph k3s01
rook-ceph-mds-ceph-filesystem-a-5cfd7fb857-gtst9 Running rook-ceph k3s04
rook-ceph-mds-ceph-filesystem-b-77499c7cd4-8hd6n Running rook-ceph k3s04
rook-ceph-mgr-a-78679477d-fwcfd Running rook-ceph k3s04
rook-ceph-mgr-b-7d4776bd8c-wv4s5 Running rook-ceph k3s06
rook-ceph-mon-o-5fff6fc4cc-jssgc Running rook-ceph k3s03
rook-ceph-mon-s-5d5944869b-d54t8 Running rook-ceph k3s06
rook-ceph-mon-v-74b98cc67c-tznxr Running rook-ceph k3s04
rook-ceph-operator-f89f679d9-jjgjs Running rook-ceph k3s02
rook-ceph-osd-0-bf7c94d97-ddsfb Running rook-ceph k3s01
rook-ceph-osd-1-6f9dfdd9b7-jbj7r Running rook-ceph k3s02
rook-ceph-osd-2-5cf49dd98d-28942 Running rook-ceph k3s03
rook-ceph-osd-3-77945cc7ff-jq27t Running rook-ceph k3s04
rook-ceph-osd-5-7cd9c9877f-dmtcn Running rook-ceph k3s06
rook-ceph-osd-6-665469c65-d72rs Running rook-ceph k3s06
rook-ceph-osd-7-d4d6cc496-ttmmg Running rook-ceph k3s04
rook-ceph-osd-prepare-k3s01-v5rxw Succeeded rook-ceph k3s01
rook-ceph-osd-prepare-k3s02-8gw8g Succeeded rook-ceph k3s02
rook-ceph-osd-prepare-k3s03-hgs95 Succeeded rook-ceph k3s03
rook-ceph-osd-prepare-k3s04-s6rzh Succeeded rook-ceph k3s04
rook-ceph-osd-prepare-k3s06-6lnxp Succeeded rook-ceph k3s06
rook-ceph-rgw-ceph-objectstore-a-7c49497cfd-j9ptl Running rook-ceph k3s01
rook-ceph-rgw-ceph-objectstore-a-7c49497cfd-x7654 Running rook-ceph k3s06
rook-ceph-tools-d447c8f95-h7gwq Running rook-ceph k3s06
rook-discover-6xr5g Running rook-ceph k3s04
rook-discover-d7s6q Running rook-ceph k3s01
rook-discover-jcf47 Running rook-ceph k3s03
rook-discover-qdmzz Running rook-ceph k3s06
rook-discover-xx99c Running rook-ceph k3s02
Warning: Pods that are 'Not' in 'Running' status
rook-ceph-osd-4-75cc6f786f-4snzx Pending rook-ceph
rook-ceph-osd-8-86b65b584-tswxq Pending rook-ceph
Info: Checking placement group status
Info: PgState: active+clean, PgCount: 193
Info: Checking if at least one mgr pod is running
rook-ceph-mgr-a-78679477d-fwcfd Running rook-ceph k3s04
rook-ceph-mgr-b-7d4776bd8c-wv4s5 Running rook-ceph k3s06
$ kubectl rook-ceph ceph status
Info: running 'ceph' command with args: [status]
cluster:
id: cb82340a-2eaf-4597-b83e-cc0e62a9d019
health: HEALTH_OK
services:
mon: 3 daemons, quorum o,s,v (age 8d)
mgr: b(active, since 8d), standbys: a
mds: 1/1 daemons up, 1 hot standby
osd: 9 osds: 7 up (since 11h), 7 in (since 11h)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 193 pgs
objects: 117.63k objects, 305 GiB
usage: 958 GiB used, 4.6 TiB / 5.5 TiB avail
pgs: 193 active+clean
io:
client: 852 B/s rd, 207 KiB/s wr, 1 op/s rd, 7 op/s wr
Environment:
- OS (e.g. from /etc/os-release):
Ubuntu 22.04.3 LTS - Kernel (e.g.
uname -a):Linux k3s02 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux - Cloud provider or hardware configuration: Bare Metal
- Rook version (use
rook versioninside of a Rook Pod):rook: v1.12.3 / go: go1.21.0 - Storage backend version (e.g. for ceph do
ceph -v):ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) - Kubernetes version (use
kubectl version):Server Version: v1.27.5+k3s1 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
K3S - Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):HEALTH_OK
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 20 (9 by maintainers)
@reefland then you previous output was expected since there is no volumeInUse with ceph RBD no networkFence cr will be created
Thank you for explaining that.
So now it’s unclear to me why that volume was unable to be freed and mounted when the fence was applied.