piraeus: kernel crashes at Oracle Linux 8
Kubernetes v1.27.5 Bare metal nodes LVM Thinpool piraeus-operator v2.4.1 Oracle Linux 8 Kernel 5.15.0-204.147.6.2.el8uek.x86_64 + default drbd image drbd9-jammy Also reproduced with kernel 4.18 + drbd image drbd9-almalinux8
How to reproduce: Create and subsequently delete a number of volumes and attach them. I tested with about 8 pvc-s and pod-s and made around 20 operations of creation and then deletion of them. Randomly the server goes to reboot because of crash. Most often it happened during volumes deletion but also it was reproduced during a new pvc creation.
UEK kernel Makefile (/usr/src/kernels/5.15.0-204.147.6.2.el8uek.x86_64/Makefile) patched to be able to build drbd:
--- Makefile 2024-01-15 12:24:44.452296691 +0000
+++ Makefile 2024-01-15 12:25:36.325543428 +0000
@@ -853,18 +853,18 @@
endif
# Initialize all stack variables with a 0xAA pattern.
-ifdef CONFIG_INIT_STACK_ALL_PATTERN
-KBUILD_CFLAGS += -ftrivial-auto-var-init=pattern
-endif
+#ifdef CONFIG_INIT_STACK_ALL_PATTERN
+#KBUILD_CFLAGS += -ftrivial-auto-var-init=pattern
+#endif
# Initialize all stack variables with a zero value.
-ifdef CONFIG_INIT_STACK_ALL_ZERO
-KBUILD_CFLAGS += -ftrivial-auto-var-init=zero
-ifdef CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO_ENABLER
+#ifdef CONFIG_INIT_STACK_ALL_ZERO
+#KBUILD_CFLAGS += -ftrivial-auto-var-init=zero
+#ifdef CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO_ENABLER
# https://github.com/llvm/llvm-project/issues/44842
-KBUILD_CFLAGS += -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang
-endif
-endif
+#KBUILD_CFLAGS += -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang
+#endif
+#endif
apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
name: piraeus-storage-pool
spec:
storagePools:
- name: piraeus-storage-pool-lvmthin
lvmThinPool:
volumeGroup: lvmvgthin
thinPool: thinpool_piraeus
podTemplate:
spec:
hostNetwork: true
nodeAffinity:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
- key: piraeus
operator: In
values:
- enabled
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: piraeus-storage-replicated-lvm
provisioner: linstor.csi.linbit.com
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
parameters:
# https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-kubernetes-sc-parameters
## CSI related parameters
csi.storage.k8s.io/fstype: ext4
## LINSTOR parameters
linstor.csi.linbit.com/storagePool: piraeus-storage-pool-lvmthin
linstor.csi.linbit.com/placementCount: "2"
linstor.csi.linbit.com/mountOpts: noatime,discard
property.linstor.csi.linbit.com/DrbdOptions/Net/max-buffers: "11000"
---
apiVersion: piraeus.io/v1
kind: LinstorCluster
metadata:
name: linstorcluster
spec:
nodeAffinity:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
# https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-autoplace-linstor
properties:
- name: DrbdOptions/Net/max-buffers # controller level
value: "10000"
- name: Autoplacer/Weights/MaxFreeSpace
value: "0" # 1 default
- name: Autoplacer/Weights/MinReservedSpace
value: "10" # preferr nodes with minimal reserved space on thin pool
- name: Autoplacer/Weights/MinRscCount
value: "0"
# - name: Autoplacer/Weights/MaxThroughput
# value: "0" # COOL but not today
cat /proc/drbd
version: 9.2.8 (api:2/proto:86-122)
GIT-hash: e163b05a76254c0f51f999970e861d72bb16409a build by @srvh52.example.com, 2024-03-28 15:13:48
Transports (api:20): tcp (9.2.8) lb-tcp (9.2.8) rdma (9.2.8)
[ 4083.197349] Call Trace:
[ 4083.208990] <TASK>
[ 4083.220334] ? show_trace_log_lvl+0x1d6/0x2f9
[ 4083.231532] ? show_trace_log_lvl+0x1d6/0x2f9
[ 4083.242553] ? drbd_free_peer_req+0x99/0x210 [drbd]
[ 4083.253383] ? __die_body.cold+0x8/0xa
[ 4083.263954] ? page_fault_oops+0x16d/0x1ac
[ 4083.274325] ? exc_page_fault+0x68/0x13b
[ 4083.284460] ? asm_exc_page_fault+0x22/0x27
[ 4083.294360] ? _raw_spin_lock_irq+0x13/0x58
[ 4083.303995] drbd_free_peer_req+0x99/0x210 [drbd]
[ 4083.313482] drbd_finish_peer_reqs+0xc0/0x180 [drbd]
[ 4083.322880] drain_resync_activity+0x25b/0x43a [drbd]
[ 4083.332060] conn_disconnect+0xf4/0x650 [drbd]
[ 4083.341017] drbd_receiver+0x53/0x60 [drbd]
[ 4083.349787] drbd_thread_setup+0x77/0x1df [drbd]
[ 4083.358332] ? drbd_reclaim_path+0x90/0x90 [drbd]
[ 4083.366677] kthread+0x127/0x144
[ 4083.374961] ? set_kthread_struct+0x60/0x52
[ 4083.382938] ret_from_fork+0x22/0x2d
[ 4083.390678] </TASK>
About this issue
- Original URL
- State: closed
- Created 3 months ago
- Comments: 19 (9 by maintainers)
Fixed on the DRBD side with https://github.com/LINBIT/drbd/commit/857db82c989b36993ff7a3df3944c9862db1408d and https://github.com/LINBIT/drbd/commit/343e077e9664b203e5ebf8146dacc5c869b80e30.
Just wanted to let you know that we think we have tracked down the issue, no fix yet but we should have something ready for next DRBD release.
Yes, there will be a 1.5.1 with that. We still intend to fix the issue in DRBD, too.