piraeus: kernel crashes at Oracle Linux 8

Kubernetes v1.27.5 Bare metal nodes LVM Thinpool piraeus-operator v2.4.1 Oracle Linux 8 Kernel 5.15.0-204.147.6.2.el8uek.x86_64 + default drbd image drbd9-jammy Also reproduced with kernel 4.18 + drbd image drbd9-almalinux8

How to reproduce: Create and subsequently delete a number of volumes and attach them. I tested with about 8 pvc-s and pod-s and made around 20 operations of creation and then deletion of them. Randomly the server goes to reboot because of crash. Most often it happened during volumes deletion but also it was reproduced during a new pvc creation.

UEK kernel Makefile (/usr/src/kernels/5.15.0-204.147.6.2.el8uek.x86_64/Makefile) patched to be able to build drbd:

--- Makefile	2024-01-15 12:24:44.452296691 +0000
+++ Makefile	2024-01-15 12:25:36.325543428 +0000
@@ -853,18 +853,18 @@
 endif
 
 # Initialize all stack variables with a 0xAA pattern.
-ifdef CONFIG_INIT_STACK_ALL_PATTERN
-KBUILD_CFLAGS	+= -ftrivial-auto-var-init=pattern
-endif
+#ifdef CONFIG_INIT_STACK_ALL_PATTERN
+#KBUILD_CFLAGS	+= -ftrivial-auto-var-init=pattern
+#endif
 
 # Initialize all stack variables with a zero value.
-ifdef CONFIG_INIT_STACK_ALL_ZERO
-KBUILD_CFLAGS	+= -ftrivial-auto-var-init=zero
-ifdef CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO_ENABLER
+#ifdef CONFIG_INIT_STACK_ALL_ZERO
+#KBUILD_CFLAGS	+= -ftrivial-auto-var-init=zero
+#ifdef CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO_ENABLER
 # https://github.com/llvm/llvm-project/issues/44842
-KBUILD_CFLAGS	+= -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang
-endif
-endif
+#KBUILD_CFLAGS	+= -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang
+#endif
+#endif
apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: piraeus-storage-pool
spec:
  storagePools:
    - name: piraeus-storage-pool-lvmthin
      lvmThinPool:
        volumeGroup: lvmvgthin
        thinPool: thinpool_piraeus
  podTemplate:
    spec:
      hostNetwork: true
  nodeAffinity:
    nodeSelectorTerms:
    - matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: DoesNotExist
      - key: piraeus
        operator: In
        values:
         - enabled
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: piraeus-storage-replicated-lvm
provisioner: linstor.csi.linbit.com
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
parameters:
  # https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-kubernetes-sc-parameters
  ## CSI related parameters
  csi.storage.k8s.io/fstype: ext4
  ## LINSTOR parameters
  linstor.csi.linbit.com/storagePool: piraeus-storage-pool-lvmthin
  linstor.csi.linbit.com/placementCount: "2"
  linstor.csi.linbit.com/mountOpts: noatime,discard
  property.linstor.csi.linbit.com/DrbdOptions/Net/max-buffers: "11000"
---
apiVersion: piraeus.io/v1
kind: LinstorCluster
metadata:
  name: linstorcluster
spec:
  nodeAffinity:
    nodeSelectorTerms:
    - matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: DoesNotExist
  # https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-autoplace-linstor
  properties:
    - name: DrbdOptions/Net/max-buffers # controller level
      value: "10000"
    - name: Autoplacer/Weights/MaxFreeSpace
      value: "0" # 1 default
    - name: Autoplacer/Weights/MinReservedSpace
      value: "10" # preferr nodes with minimal reserved space on thin pool
    - name: Autoplacer/Weights/MinRscCount
      value: "0"
    # - name: Autoplacer/Weights/MaxThroughput
    #   value: "0" # COOL but not today
cat /proc/drbd 
version: 9.2.8 (api:2/proto:86-122)
GIT-hash: e163b05a76254c0f51f999970e861d72bb16409a build by @srvh52.example.com, 2024-03-28 15:13:48
Transports (api:20): tcp (9.2.8) lb-tcp (9.2.8) rdma (9.2.8)
[ 4083.197349] Call Trace:
[ 4083.208990]  <TASK>
[ 4083.220334]  ? show_trace_log_lvl+0x1d6/0x2f9
[ 4083.231532]  ? show_trace_log_lvl+0x1d6/0x2f9
[ 4083.242553]  ? drbd_free_peer_req+0x99/0x210 [drbd]
[ 4083.253383]  ? __die_body.cold+0x8/0xa
[ 4083.263954]  ? page_fault_oops+0x16d/0x1ac
[ 4083.274325]  ? exc_page_fault+0x68/0x13b
[ 4083.284460]  ? asm_exc_page_fault+0x22/0x27
[ 4083.294360]  ? _raw_spin_lock_irq+0x13/0x58
[ 4083.303995]  drbd_free_peer_req+0x99/0x210 [drbd]
[ 4083.313482]  drbd_finish_peer_reqs+0xc0/0x180 [drbd]
[ 4083.322880]  drain_resync_activity+0x25b/0x43a [drbd]
[ 4083.332060]  conn_disconnect+0xf4/0x650 [drbd]
[ 4083.341017]  drbd_receiver+0x53/0x60 [drbd]
[ 4083.349787]  drbd_thread_setup+0x77/0x1df [drbd]
[ 4083.358332]  ? drbd_reclaim_path+0x90/0x90 [drbd]
[ 4083.366677]  kthread+0x127/0x144
[ 4083.374961]  ? set_kthread_struct+0x60/0x52
[ 4083.382938]  ret_from_fork+0x22/0x2d
[ 4083.390678]  </TASK>

About this issue

  • Original URL
  • State: closed
  • Created 3 months ago
  • Comments: 19 (9 by maintainers)

Most upvoted comments

Just wanted to let you know that we think we have tracked down the issue, no fix yet but we should have something ready for next DRBD release.

Yes, there will be a 1.5.1 with that. We still intend to fix the issue in DRBD, too.