Flatcar: Stable 3227.2.2 randomly causes processes to hang on I/O related operations

Description

We’ve seen multiple nodes (different regions and environments) stalling completely (unresponsive kubelets and containerds, journald and timesyncd units failing, no login possible, …) on the 3227.2.2 release. This seems to be happening mostly on the reboot after the update, but we also had this occurring at random.

Impact

This causes Kubernetes nodes to become NotReady before being drained, which involves volumes not being able to be moved and therefore service interruptions.

Environment and steps to reproduce

  1. Set-up: Flatcar Stable 3227.2.2, OpenStack, ESXi hypervisors
  2. Task: on node start, but also during normal operation
  3. Action(s): no clue if anything specific causes this
  4. Error:
  • we get task blocked for more than 120 seconds errors and related call stacks, see the screenshot
  • Screenshot 2022-09-07 at 14 49 03
  • CPU usage is very higher when the issue occurs
  • the journal gets corrupted when this issue occurs
  • rebooting once more, brings the node back
  • once the node is up again, partitions and filesystems appear healthy

Expected behavior

The nodes do not stall completely.

Additional information

We have the feeling that we may hit a kernel bug as we only see this on the 3227.2.2 release were basically only the kernel was changed. Do you have ideas how we can diagnose this further? Thanks.

cc @databus23 @defo89 @jknipper

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 15
  • Comments: 59 (24 by maintainers)

Most upvoted comments

We’ll make sure the fix (either 5.15.87 or the backport) is part of the next stable release scheduled for 2022-01-23.

There’s a fix available now (https://lore.kernel.org/linux-ext4/20221122115715.kxqhsk2xs4nrofyb@quack3/T/#ma65f0113b4b2f9259b8f7d8af16b8cb351728d20) and I’ve tested it on 5.15 with the reproducer in this thread, I can no longer reproduce the hang.

This should now get merged upstream and subsequently backported. Then it’ll be included in a Flatcar stable release.

So the update is that the issue is also present in upstream 6.0 kernels, it appears to be a logic bug in this commit: https://github.com/torvalds/linux/commit/65f8b80053a1b2fd602daa6814e62d6fa90e5e9b, which results in processes getting stuck trying to reuse a block that has too many references.

I’ve provided more debug output to the kernel mailing list, now that the problem has been identified I hope this will lead to a fix soon.

Okay I think the patch provided above fixes the problem. Ive deployed two nodes, one running stock 5.15.63 and a second running with the patch. On the regular node without the patch, I could recreate this reliably by spawning 20-30 pods each pulling different images. Every test run would result in SSH connections dying and the node would be become completely unresponsive with most of the images stuck in a ContainerCreating state. On the node with the patch I ran the pull a couple of times, pruning the images between each run. No problems were encountered and the images all entered a Running state successfully, SSH connections were also responsive during this time.

Didn’t mean to close this. It will land in stable in 1-2 release cycles (next time beta promotes), or when the bugfix lands in the stable trees.

If you’re hitting this issue and have been holding off on updates - do update to the beta release that will come this week.

State of our research so far:

The suspicious commit has not been reverted yet, and is still in main (therefore, also in the 5.15 LTS kernel series). We will continue to closely monitor the situation.

Kudos go to @vbatts and to @jepio for doing the leg work.

No, it’s added to beta in 3432.1.0. So as long as stable is on a smaller version it hasn’t trickled down to stable.

Correct. The fix is not in Stable yet, but only in Beta & Alpha.

Good news is, the ext4 deadlock fix was recently merged to the mainline Kernel. Though that is still not included in any Kernel release. Looks like it needs a little more time. Anyway most of Flatcar maintainers already left for holidays, so the next release should happen in Jan. in the next year.

Flatcar Stable 3374.2.3 was released with the bugfix, Kernel 5.15.86 with the backport. Please try it out, and see if the bug was gone.

The fix is now released with Kernel 6.1.4

Yesterdays release did not include the bugfix for this! was that intentional.

as i am really waiting for this! If that will take a longer time, i would need to temporary switch to the beta channel, but that is a lot of work to switch everything 😦

We have been testing the new patch [1] which we have applied against 5.15.80 in a custom build and it is currently looking stable again on our development clusters with Flatcar 3227.2.2

[1] https://lore.kernel.org/linux-ext4/20221122174807.GA9658@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/2-0001-ext4-Fix-deadlock-due-to-mbcache-en.patch

That’s too bad. On a positive note - i’m now able to recreate this with the reproducer you provided 😃

Turns out looking at the hung tasks only reveals half the picture because there are also some processes that are stuck running somewhere in ext4 code, i’ll post an update as soon as i understand better.

This is definitely a vanilla kernel issue, as Flatcar runs a vanilla kernel (the 6 patches we carry are arm64 specific or build system tweaks).

We agree this is pretty serious, but since none of us are filesystem developers, we don’t just want to go around reverting filesystem code patches, and risking the integrity of users’ data. We’re waiting on upstream kernel developers. In the mean time I suggest you freeze the Flatcar release that you’re using to one that is unaffected (an earlier stable or lts).

@jepio I think I have managed to get some of the list of hung tasks, I cant be sure how complete the list is, but I think it should be a good starting point nonetheless hung-tasks.txt

We tried to reproduce the case again today in our test cluster. We installed one node with flatcar version 3227.2.1 and the other with 3227.2.2.

Both nodes have the following hardware

  • 128 GB DDR4 ECC
  • 2 x 1.92 TB NVME
  • AMD Ryzen 9 3900 (24 cores)
  • K8S v1.24.5

We have used dd, stress as well as stress-ng, unsuccessfully! We had a high load, but the 3227.2.2 nodes did not crash.

Then we crafted the following K8S resource with a small Golang tool. And this actually managed to crash the 3227.2.2 node after about 5 minutes. SSH and other services were no longer accessible.

On the 3227.2.1 node the script ran until the filesystem reported full inodes, but it was still reachable.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: io-test
  labels: 
    app: io-test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: io-test
  template:
    metadata:
      labels:
        app: io-test
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values:
                      - <select-a-node>
      containers:
        - name: io-test
          image: golang:1.19.1
          command:
          - bash
          - -c
          - |
            apt update && apt install time -y
            echo '
              package main

              import (
                "fmt"
                "log"
                "os"
              )

              func main() {
                suffix := ""
                if len(os.Args) > 1 {
                  suffix = os.Args[1]
                }

                err := run(suffix)
                if err != nil {
                  log.Fatal(err)
                }
              }

              func run(suffix string) error {
                for i := 0; i < 100_000; i++ {
                  if i%10_000 == 0 {
                    log.Printf("file #%d", i)
                  }

                  path := fmt.Sprintf("file_%d_%s", i, suffix)
                  content := path
                  err := os.WriteFile(path, []byte(content), 0644)
                  if err != nil {
                    return err
                  }
                }

                return nil
              }
            ' >> /test/main.go
            cd /test
            go build main.go

            while true
            do
              time ./main $(date +%s)
            done
          volumeMounts:
          - name: io-test
            mountPath: "/test"
      restartPolicy: Always
      volumes:
        - name: io-test
          emptyDir: {}

When the 3227.2.2 node crashed, the kubelet process was marked as a zombie and in the kernel ring buffer we saw the following errors

[2091.827035] INFO: task jbd2/nvme0n1p9-:755 blocked for more than 122 seconds.
[ 2091.834299]       Not tainted 5.15.63-flatcar #1
[ 2091.839081] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2091.847099] task:jbd2/nvme0n1p9- state:D stack:    0 pid:  755 ppid:     2 flags:0x00004000
[ 2091.855634] Call Trace:
[ 2091.858241]  <TASK>
[ 2091.860505]  __schedule+0x2eb/0x8d0
[ 2091.864153]  schedule+0x5b/0xd0
[ 2091.867456]  jbd2_journal_commit_transaction+0x301/0x2850 [jbd2]
[ 2091.873628]  ? sugov_update_single_freq+0xb9/0x190
[ 2091.878582]  ? sugov_exit+0xa0/0xa0
[ 2091.882229]  ? wait_woken+0x70/0x70
[ 2091.885878]  ? lock_timer_base+0x61/0x80
[ 2091.889962]  jbd2_journal_check_available_features+0x1ab/0x3f0 [jbd2]
[ 2091.896565]  ? wait_woken+0x70/0x70
[ 2091.900214]  ? jbd2_journal_check_available_features+0x100/0x3f0 [jbd2]
[ 2091.906982]  kthread+0x127/0x150
[ 2091.910379]  ? set_kthread_struct+0x50/0x50
[ 2091.914720]  ret_from_fork+0x22/0x30
[ 2091.918455]  </TASK>
[ 2091.920803] INFO: task systemd-journal:988 blocked for more than 122 seconds.
[ 2091.928104]       Not tainted 5.15.63-flatcar #1
[ 2091.932876] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2091.940891] task:systemd-journal state:D stack:    0 pid:  988 ppid:     1 flags:0x00000224
[ 2091.949428] Call Trace:
[ 2091.952038]  <TASK>
[ 2091.954301]  __schedule+0x2eb/0x8d0
[ 2091.957949]  schedule+0x5b/0xd0
[ 2091.961260]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[ 2091.966815]  ? wait_woken+0x70/0x70
[ 2091.970461]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[ 2091.976104]  ? call_rcu+0xa2/0x330
[ 2091.979667]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[ 2091.985311]  ? step_into+0x47c/0x7b0
[ 2091.989052]  ? __cond_resched+0x16/0x50
[ 2091.993047]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 2091.997998]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 2092.003302]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 2092.007902]  __mark_inode_dirty+0x147/0x320
[ 2092.012255]  touch_atime+0x13c/0x150
[ 2092.015991]  filemap_read+0x308/0x320
[ 2092.019810]  ? path_lookupat+0xa2/0x1c0
[ 2092.023815]  new_sync_read+0x119/0x1b0
[ 2092.027735]  ? 0xffffffffb6000000
[ 2092.031207]  vfs_read+0xf6/0x190
[ 2092.034596]  __x64_sys_pread64+0x91/0xc0
[ 2092.038678]  do_syscall_64+0x3b/0x90
[ 2092.042413]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 2092.047622] RIP: 0033:0x7f7b5d090b17
[ 2092.051356] RSP: 002b:00007ffc6dedc4c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
[ 2092.059113] RAX: ffffffffffffffda RBX: 00007ffc6dedc570 RCX: 00007f7b5d090b17
[ 2092.066401] RDX: 0000000000000040 RSI: 00007ffc6dedc4f0 RDI: 0000000000000018
[ 2092.073701] RBP: 0000000000f25b78 R08: 0000000000000000 R09: 00007ffc6dedc700
[ 2092.080998] R10: 0000000000f25b78 R11: 0000000000000293 R12: 000055d55b175310
[ 2092.088284] R13: 0000000000000000 R14: 00007ffc6dedc4f0 R15: 000055d55b175310
[ 2092.095576]  </TASK>
[ 2092.097944] INFO: task systemd-timesyn:1140 blocked for more than 123 seconds.
[ 2092.105360]       Not tainted 5.15.63-flatcar #1
[ 2092.110132] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2092.118139] task:systemd-timesyn state:D stack:    0 pid: 1140 ppid:     1 flags:0x00000220
[ 2092.126678] Call Trace:
[ 2092.129287]  <TASK>
[ 2092.131546]  __schedule+0x2eb/0x8d0
[ 2092.135199]  schedule+0x5b/0xd0
[ 2092.138500]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[ 2092.144063]  ? wait_woken+0x70/0x70
[ 2092.147721]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[ 2092.153361]  ? __fget_files+0x79/0xb0
[ 2092.157193]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[ 2092.162846]  ? nd_jump_link+0x4d/0xc0
[ 2092.166681]  ? __cond_resched+0x16/0x50
[ 2092.170677]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 2092.175627]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 2092.180921]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 2092.185533]  __mark_inode_dirty+0x147/0x320
[ 2092.189877]  ext4_setattr+0x1fb/0x9a0 [ext4]
[ 2092.194311]  notify_change+0x3c4/0x540
[ 2092.198221]  ? vfs_utimes+0x139/0x220
[ 2092.202043]  vfs_utimes+0x139/0x220
[ 2092.205692]  do_utimes+0xb4/0x120
[ 2092.209165]  __x64_sys_utimensat+0x70/0xb0
[ 2092.213431]  ? syscall_trace_enter.constprop.0+0x143/0x1c0
[ 2092.219073]  do_syscall_64+0x3b/0x90
[ 2092.222815]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 2092.228036] RIP: 0033:0x7f26ccdc801f
[ 2092.231770] RSP: 002b:00007ffc7da41a98 EFLAGS: 00000206 ORIG_RAX: 0000000000000118
[ 2092.239533] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f26ccdc801f
[ 2092.246824] RDX: 0000000000000000 RSI: 00007ffc7da41aa0 RDI: 00000000ffffff9c
[ 2092.254130] RBP: 00007ffc7da41aa0 R08: 0000000000000000 R09: 00007ffc7da41930
[ 2092.261427] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
[ 2092.268715] R13: 00000000ffffffff R14: ffffffffffffffff R15: 00000000ffffffff
[ 2092.276013]  </TASK>
[ 2092.280018] INFO: task containerd:1257 blocked for more than 123 seconds.
[ 2092.286959]       Not tainted 5.15.63-flatcar #1
[ 2092.291735] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2092.299750] task:containerd      state:D stack:    0 pid: 1257 ppid:     1 flags:0x00000000
[ 2092.308287] Call Trace:
[ 2092.310903]  <TASK>
[ 2092.313175]  __schedule+0x2eb/0x8d0
[ 2092.316821]  ? newidle_balance+0x12a/0x400
[ 2092.321086]  schedule+0x5b/0xd0
[ 2092.324399]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[ 2092.329962]  ? wait_woken+0x70/0x70
[ 2092.333617]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[ 2092.339260]  ? select_task_rq_fair+0x130/0xf90
[ 2092.343871]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[ 2092.349531]  ? ttwu_queue_wakelist+0xf2/0x110
[ 2092.354056]  ? __cond_resched+0x16/0x50
[ 2092.358056]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 2092.363007]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 2092.368311]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 2092.372920]  __mark_inode_dirty+0x147/0x320
[ 2092.377274]  generic_update_time+0x6c/0xd0
[ 2092.381527]  file_update_time+0x127/0x140
[ 2092.385704]  ? generic_write_checks+0x61/0xc0
[ 2092.390229]  ext4_llseek+0x3fa/0x1cb0 [ext4]
[ 2092.394666]  new_sync_write+0x11c/0x1b0
[ 2092.398660]  ? intel_get_event_constraints+0x300/0x390
[ 2092.403964]  vfs_write+0x1de/0x270
[ 2092.407536]  ksys_write+0x5f/0xe0
[ 2092.411010]  do_syscall_64+0x3b/0x90
[ 2092.414746]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 2092.419954] RIP: 0033:0x55d3b77a2edb
[ 2092.423698] RSP: 002b:000000c0005228d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 2092.431454] RAX: ffffffffffffffda RBX: 000000c000054000 RCX: 000055d3b77a2edb
[ 2092.438751] RDX: 0000000000000117 RSI: 000000c002fe3500 RDI: 0000000000000255
[ 2092.446053] RBP: 000000c000522920 R08: 0000000000000201 R09: 000000c003776120
[ 2092.453346] R10: 0000000000000117 R11: 0000000000000202 R12: 000000c0005229b0
[ 2092.460644] R13: 0000000000000000 R14: 000000c002c9cea0 R15: 0000000000000032
[ 2092.467945]  </TASK>
[ 2092.470307] INFO: task containerd:4089 blocked for more than 123 seconds.
[ 2092.477260]       Not tainted 5.15.63-flatcar #1
[ 2092.482042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2092.490059] task:containerd      state:D stack:    0 pid: 4089 ppid:     1 flags:0x00000000
[ 2092.498597] Call Trace:
[ 2092.501212]  <TASK>
[ 2092.503475]  __schedule+0x2eb/0x8d0
[ 2092.507131]  schedule+0x5b/0xd0
[ 2092.510437]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[ 2092.515997]  ? wait_woken+0x70/0x70
[ 2092.519647]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[ 2092.525289]  ? select_task_rq_fair+0x130/0xf90
[ 2092.529900]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[ 2092.535550]  ? ttwu_queue_wakelist+0xf2/0x110
[ 2092.540073]  ? __cond_resched+0x16/0x50
[ 2092.544068]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 2092.549019]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 2092.554321]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 2092.558932]  __mark_inode_dirty+0x147/0x320
[ 2092.563282]  generic_update_time+0x6c/0xd0
[ 2092.567537]  file_update_time+0x127/0x140
[ 2092.571708]  ? generic_write_checks+0x61/0xc0
[ 2092.576233]  ext4_llseek+0x3fa/0x1cb0 [ext4]
[ 2092.580668]  new_sync_write+0x11c/0x1b0
[ 2092.584672]  ? intel_get_event_constraints+0x300/0x390
[ 2092.589977]  vfs_write+0x1de/0x270
[ 2092.593539]  ksys_write+0x5f/0xe0
[ 2092.597015]  do_syscall_64+0x3b/0x90
[ 2092.600747]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 2092.605957] RIP: 0033:0x55d3b77a2edb
[ 2092.609701] RSP: 002b:000000c0025e18d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 2092.617456] RAX: ffffffffffffffda RBX: 000000c00004a000 RCX: 000055d3b77a2edb
[ 2092.624747] RDX: 00000000000000ac RSI: 000000c000b760c0 RDI: 00000000000000d2
[ 2092.632044] RBP: 000000c0025e1920 R08: 0000000000000001 R09: 000000c001ca3c80
[ 2092.639343] R10: 00000000000000ac R11: 0000000000000202 R12: 000000c0025e19b0
[ 2092.646639] R13: 0000000000000000 R14: 000000c001504d00 R15: 0000000000000035
[ 2092.653938]  </TASK>
[ 2092.656293] INFO: task containerd:6560 blocked for more than 123 seconds.
[ 2092.663246]       Not tainted 5.15.63-flatcar #1
[ 2092.668027] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2092.676044] task:containerd      state:D stack:    0 pid: 6560 ppid:     1 flags:0x00000000
[ 2092.684581] Call Trace:
[ 2092.687188]  <TASK>
[ 2092.689450]  __schedule+0x2eb/0x8d0
[ 2092.693100]  ? newidle_balance+0x12a/0x400
[ 2092.697354]  schedule+0x5b/0xd0
[ 2092.700659]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[ 2092.706224]  ? wait_woken+0x70/0x70
[ 2092.709870]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[ 2092.715521]  ? select_task_rq_fair+0x130/0xf90
[ 2092.720124]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[ 2092.725773]  ? ttwu_queue_wakelist+0xf2/0x110
[ 2092.730288]  ? __cond_resched+0x16/0x50
[ 2092.734284]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 2092.739231]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 2092.744539]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 2092.749148]  __mark_inode_dirty+0x147/0x320
[ 2092.753500]  generic_update_time+0x6c/0xd0
[ 2092.757754]  file_update_time+0x127/0x140
[ 2092.761931]  ? generic_write_checks+0x61/0xc0
[ 2092.766456]  ext4_llseek+0x3fa/0x1cb0 [ext4]
[ 2092.770895]  new_sync_write+0x11c/0x1b0
[ 2092.774897]  ? intel_get_event_constraints+0x300/0x390
[ 2092.780199]  vfs_write+0x1de/0x270
[ 2092.783763]  ksys_write+0x5f/0xe0
[ 2092.787245]  do_syscall_64+0x3b/0x90
[ 2092.790981]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 2092.796200] RIP: 0033:0x55d3b77a2edb
[ 2092.799942] RSP: 002b:000000c00237a8d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 2092.807698] RAX: ffffffffffffffda RBX: 000000c00004f000 RCX: 000055d3b77a2edb
[ 2092.814989] RDX: 0000000000000097 RSI: 000000c001e1c540 RDI: 0000000000000227
[ 2092.822287] RBP: 000000c00237a920 R08: 0000000000000201 R09: 000000c0019f21e0
[ 2092.829581] R10: 0000000000000097 R11: 0000000000000202 R12: 000000c00237a9b0
[ 2092.836881] R13: 0000000000000000 R14: 000000c001cc8d00 R15: 0000000000000034
[ 2092.844177]  </TASK>
[ 2092.846531] INFO: task containerd:7559 blocked for more than 123 seconds.
[ 2092.853470]       Not tainted 5.15.63-flatcar #1
[ 2092.858243] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2092.866258] task:containerd      state:D stack:    0 pid: 7559 ppid:     1 flags:0x00000000
[ 2092.874794] Call Trace:
[ 2092.877406]  <TASK>
[ 2092.879666]  __schedule+0x2eb/0x8d0
[ 2092.883314]  ? newidle_balance+0x12a/0x400
[ 2092.887569]  schedule+0x5b/0xd0
[ 2092.890873]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[ 2092.896436]  ? wait_woken+0x70/0x70
[ 2092.900085]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[ 2092.905736]  ? select_task_rq_fair+0x130/0xf90
[ 2092.910347]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[ 2092.915997]  ? ttwu_queue_wakelist+0xf2/0x110
[ 2092.920521]  ? __cond_resched+0x16/0x50
[ 2092.924524]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 2092.929473]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 2092.934777]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 2092.939386]  __mark_inode_dirty+0x147/0x320
[ 2092.943741]  generic_update_time+0x6c/0xd0
[ 2092.947995]  file_update_time+0x127/0x140
[ 2092.952170]  ? generic_write_checks+0x61/0xc0
[ 2092.956693]  ext4_llseek+0x3fa/0x1cb0 [ext4]
[ 2092.961124]  new_sync_write+0x11c/0x1b0
[ 2092.965129]  ? intel_get_event_constraints+0x300/0x390
[ 2092.970424]  vfs_write+0x1de/0x270
[ 2092.973984]  ksys_write+0x5f/0xe0
[ 2092.977459]  do_syscall_64+0x3b/0x90
[ 2092.981195]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 2092.986412] RIP: 0033:0x55d3b77a2edb
[ 2092.990148] RSP: 002b:000000c00285c8d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 2092.997903] RAX: ffffffffffffffda RBX: 000000c000074800 RCX: 000055d3b77a2edb
[ 2093.005203] RDX: 0000000000000088 RSI: 000000c000a13500 RDI: 0000000000000236
[ 2093.012502] RBP: 000000c00285c920 R08: 0000000000000201 R09: 000000c002024f00
[ 2093.019796] R10: 00007ffc79f7e080 R11: 0000000000000202 R12: 000000c00285c810
[ 2093.027095] R13: 0000000000000000 R14: 000000c001d89520 R15: 000000000000002f
[ 2093.034390]  </TASK>
[ 2093.036749] INFO: task containerd:7563 blocked for more than 124 seconds.
[ 2093.043690]       Not tainted 5.15.63-flatcar #1
[ 2093.048463] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2093.056472] task:containerd      state:D stack:    0 pid: 7563 ppid:     1 flags:0x00000000
[ 2093.065011] Call Trace:
[ 2093.067619]  <TASK>
[ 2093.069881]  __schedule+0x2eb/0x8d0
[ 2093.073537]  ? newidle_balance+0x12a/0x400
[ 2093.077793]  schedule+0x5b/0xd0
[ 2093.081095]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[ 2093.086651]  ? wait_woken+0x70/0x70
[ 2093.090308]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[ 2093.095959]  ? select_task_rq_fair+0x130/0xf90
[ 2093.100569]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[ 2093.106209]  ? ttwu_queue_wakelist+0xf2/0x110
[ 2093.112364]  ? __cond_resched+0x16/0x50
[ 2093.116360]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 2093.121309]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 2093.126613]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 2093.131214]  __mark_inode_dirty+0x147/0x320
[ 2093.135557]  generic_update_time+0x6c/0xd0
[ 2093.139810]  file_update_time+0x127/0x140
[ 2093.143979]  ? generic_write_checks+0x61/0xc0
[ 2093.148494]  ext4_llseek+0x3fa/0x1cb0 [ext4]
[ 2093.152923]  new_sync_write+0x11c/0x1b0
[ 2093.156920]  ? intel_get_event_constraints+0x300/0x390
[ 2093.162222]  vfs_write+0x1de/0x270
[ 2093.165785]  ksys_write+0x5f/0xe0
[ 2093.169267]  do_syscall_64+0x3b/0x90
[ 2093.173004]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 2093.178213] RIP: 0033:0x55d3b77a2edb
[ 2093.181947] RSP: 002b:000000c0027fa8d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 2093.189704] RAX: ffffffffffffffda RBX: 000000c000074800 RCX: 000055d3b77a2edb
[ 2093.196997] RDX: 00000000000001f5 RSI: 000000c0001e22c0 RDI: 000000000000013b
[ 2093.204289] RBP: 000000c0027fa920 R08: 0000000000000101 R09: 000000c003288360
[ 2093.211592] R10: 00000000000001f5 R11: 0000000000000202 R12: 000000c0027fa9b0
[ 2093.218875] R13: 0000000000000000 R14: 000000c00117c820 R15: 000000000000002f
[ 2093.226176]  </TASK>
[ 2093.228524] INFO: task containerd:9227 blocked for more than 124 seconds.
[ 2093.235475]       Not tainted 5.15.63-flatcar #1
[ 2093.240257] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2093.248273] task:containerd      state:D stack:    0 pid: 9227 ppid:     1 flags:0x00000000
[ 2093.256810] Call Trace:
[ 2093.259420]  <TASK>
[ 2093.261682]  __schedule+0x2eb/0x8d0
[ 2093.265330]  ? newidle_balance+0x12a/0x400
[ 2093.269586]  schedule+0x5b/0xd0
[ 2093.272886]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[ 2093.278451]  ? wait_woken+0x70/0x70
[ 2093.282099]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[ 2093.287740]  ? cpumask_next+0x1e/0x30
[ 2093.291562]  ? select_task_rq_fair+0x130/0xf90
[ 2093.296165]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[ 2093.301809]  ? ttwu_queue_wakelist+0xf2/0x110
[ 2093.306331]  ? __cond_resched+0x16/0x50
[ 2093.310327]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 2093.315276]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 2093.320579]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 2093.325192]  __mark_inode_dirty+0x147/0x320
[ 2093.329530]  generic_update_time+0x6c/0xd0
[ 2093.333789]  file_update_time+0x127/0x140
[ 2093.337965]  ? generic_write_checks+0x61/0xc0
[ 2093.342488]  ext4_llseek+0x3fa/0x1cb0 [ext4]
[ 2093.346917]  new_sync_write+0x11c/0x1b0
[ 2093.350921]  ? intel_get_event_constraints+0x300/0x390
[ 2093.356224]  vfs_write+0x1de/0x270
[ 2093.359788]  ksys_write+0x5f/0xe0
[ 2093.363270]  do_syscall_64+0x3b/0x90
[ 2093.367007]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 2093.372215] RIP: 0033:0x55d3b77a2edb
[ 2093.375949] RSP: 002b:000000c000fa88d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 2093.383696] RAX: ffffffffffffffda RBX: 000000c00004a000 RCX: 000055d3b77a2edb
[ 2093.390987] RDX: 0000000000000153 RSI: 000000c000e96000 RDI: 000000000000003f
[ 2093.398276] RBP: 000000c000fa8920 R08: 0000000000000001 R09: 000000c00084d3e0
[ 2093.405571] R10: 0000000000000153 R11: 0000000000000202 R12: 000000c000fa89b0
[ 2093.412860] R13: 0000000000000000 R14: 000000c001a471e0 R15: 0000000000000031
[ 2093.420160]  </TASK>
[ 2093.422514] INFO: task containerd:9229 blocked for more than 124 seconds.
[ 2093.429459]       Not tainted 5.15.63-flatcar #1
[ 2093.434232] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2093.442248] task:containerd      state:D stack:    0 pid: 9229 ppid:     1 flags:0x00000000
[ 2093.450785] Call Trace:
[ 2093.453406]  <TASK>
[ 2093.455664]  __schedule+0x2eb/0x8d0
[ 2093.459315]  schedule+0x5b/0xd0
[ 2093.462617]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[ 2093.468171]  ? wait_woken+0x70/0x70
[ 2093.471819]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[ 2093.477469]  ? select_task_rq_fair+0x130/0xf90
[ 2093.482073]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[ 2093.487726]  ? ttwu_queue_wakelist+0xf2/0x110
[ 2093.492247]  ? __cond_resched+0x16/0x50
[ 2093.496244]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 2093.501200]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 2093.506504]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 2093.511114]  __mark_inode_dirty+0x147/0x320
[ 2093.515458]  generic_update_time+0x6c/0xd0
[ 2093.519714]  file_update_time+0x127/0x140
[ 2093.523879]  ? generic_write_checks+0x61/0xc0
[ 2093.528403]  ext4_llseek+0x3fa/0x1cb0 [ext4]
[ 2093.532832]  new_sync_write+0x11c/0x1b0
[ 2093.536836]  ? intel_get_event_constraints+0x300/0x390
[ 2093.542140]  vfs_write+0x1de/0x270
[ 2093.545702]  ksys_write+0x5f/0xe0
[ 2093.549176]  do_syscall_64+0x3b/0x90
[ 2093.552913]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 2093.558133] RIP: 0033:0x55d3b77a2edb
[ 2093.561864] RSP: 002b:000000c00245f8d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 2093.569624] RAX: ffffffffffffffda RBX: 000000c000054000 RCX: 000055d3b77a2edb
[ 2093.576918] RDX: 0000000000001158 RSI: 000000c0013a9900 RDI: 00000000000002e3
[ 2093.584221] RBP: 000000c00245f920 R08: 0000000000000201 R09: 000000c002d243c0
[ 2093.591506] R10: 0000000000001158 R11: 0000000000000202 R12: 000000c00245f9b0
[ 2093.598793] R13: 0000000000000000 R14: 000000c004509860 R15: 0000000000000033
[ 2093.606086]  </TASK>

cc (@damoon)

I assume that this issue was resolved in Stable 3374.2.3. Please create new issues for other bugs. Thanks.

Any update on the upstream merging? Is this progressing? I’m not familiar with kernel development and so far it seems to me nobody has responded to the patch by the ext4 developer and nothing seems to be happening.

Unfortunately there hasn’t been any response from the ext4 maintainer, I’ve poked them this week.

In the meantime we’ll likely take the backport into the next alpha/beta Flatcar release next week.

@Champ-Goblem thanks all the help with testing and reproducing, it’s good to have a datapoint confirming that no new issues pop-up with the patch.

I have rebuilt flatcar with the above PR, below should be the logs from the system during the failure.

[ 1599.005306] INFO: task jbd2/sda9-8:702 blocked for more than 122 seconds.
[ 1599.012290]       Not tainted 5.15.63-flatcar #1
[ 1599.017128] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1599.025100] task:jbd2/sda9-8     state:D stack:    0 pid:  702 ppid:     2 flags:0x00004000
[ 1599.033579] Call Trace:
[ 1599.036144]  <TASK>
[ 1599.038354]  __schedule+0x2eb/0x8d0
[ 1599.042109]  schedule+0x5b/0xd0
[ 1599.045372]  jbd2_journal_commit_transaction+0x301/0x18e0 [jbd2]
[ 1599.051518]  ? wait_woken+0x70/0x70
[ 1599.055127]  ? lock_timer_base+0x61/0x80
[ 1599.059181]  kjournald2+0xab/0x270 [jbd2]
[ 1599.063317]  ? wait_woken+0x70/0x70
[ 1599.066923]  ? load_superblock.part.0+0xb0/0xb0 [jbd2]
[ 1599.072200]  kthread+0x124/0x150
[ 1599.075543]  ? set_kthread_struct+0x50/0x50
[ 1599.079849]  ret_from_fork+0x1f/0x30
[ 1599.083538]  </TASK>
[ 1599.085835] INFO: task kworker/u32:13:732 blocked for more than 122 seconds.
[ 1599.093010]       Not tainted 5.15.63-flatcar #1
[ 1599.097739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1599.105688] task:kworker/u32:13  state:D stack:    0 pid:  732 ppid:     2 flags:0x00004000
[ 1599.114160] Workqueue: writeback wb_workfn (flush-8:0)
[ 1599.119418] Call Trace:
[ 1599.121976]  <TASK>
[ 1599.124192]  __schedule+0x2eb/0x8d0
[ 1599.127797]  schedule+0x5b/0xd0
[ 1599.131051]  wait_transaction_locked+0x8a/0xd0 [jbd2]
[ 1599.136227]  ? wait_woken+0x70/0x70
[ 1599.139829]  add_transaction_credits+0xd9/0x2b0 [jbd2]
[ 1599.145091]  ? find_get_pages_range+0x197/0x200
[ 1599.149832]  start_this_handle+0xfb/0x520 [jbd2]
[ 1599.154591]  ? mpage_release_unused_pages+0x1c7/0x1e0 [ext4]
[ 1599.160524]  ? __cond_resched+0x16/0x50
[ 1599.164478]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 1599.169393]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 1599.174669]  ext4_writepages+0x302/0xfd0 [ext4]
[ 1599.179328]  ? __find_get_block+0xb3/0x2c0
[ 1599.183539]  ? __cond_resched+0x16/0x50
[ 1599.187518]  ? __getblk_gfp+0x27/0x60
[ 1599.191307]  ? cpumask_next_and+0x1f/0x30
[ 1599.195433]  ? update_sd_lb_stats.constprop.0+0xff/0x8a0
[ 1599.200888]  do_writepages+0xce/0x200
[ 1599.204788]  ? _raw_spin_unlock_irqrestore+0xa/0x30
[ 1599.209796]  ? percpu_counter_add_batch+0x5b/0x70
[ 1599.214627]  ? fprop_reflect_period_percpu.isra.0+0x7b/0xc0
[ 1599.220439]  __writeback_single_inode+0x39/0x290
[ 1599.225183]  writeback_sb_inodes+0x20d/0x490
[ 1599.229569]  __writeback_inodes_wb+0x4c/0xe0
[ 1599.233951]  wb_writeback+0x1c0/0x280
[ 1599.237727]  wb_workfn+0x29f/0x4d0
[ 1599.241244]  process_one_work+0x223/0x3c0
[ 1599.245371]  worker_thread+0x50/0x410
[ 1599.249146]  ? process_one_work+0x3c0/0x3c0
[ 1599.253460]  kthread+0x124/0x150
[ 1599.256811]  ? set_kthread_struct+0x50/0x50
[ 1599.261110]  ret_from_fork+0x1f/0x30
[ 1599.264820]  </TASK>
[ 1599.267171] INFO: task systemd-journal:1098 blocked for more than 123 seconds.
[ 1599.274538]       Not tainted 5.15.63-flatcar #1
[ 1599.279282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1599.287256] task:systemd-journal state:D stack:    0 pid: 1098 ppid:     1 flags:0x00000224
[ 1599.295753] Call Trace:
[ 1599.298343]  <TASK>
[ 1599.300570]  __schedule+0x2eb/0x8d0
[ 1599.304178]  schedule+0x5b/0xd0
[ 1599.307434]  wait_transaction_locked+0x8a/0xd0 [jbd2]
[ 1599.312814]  ? wait_woken+0x70/0x70
[ 1599.316431]  add_transaction_credits+0xd9/0x2b0 [jbd2]
[ 1599.321722]  start_this_handle+0xfb/0x520 [jbd2]
[ 1599.326460]  ? __cond_resched+0x16/0x50
[ 1599.330414]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 1599.335331]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 1599.340615]  ext4_truncate+0x167/0x480 [ext4]
[ 1599.345134]  ext4_setattr+0x59a/0x9a0 [ext4]
[ 1599.349567]  ? virtnet_poll+0x31b/0x45b [virtio_net]
[ 1599.354663]  ? common_interrupt+0xf/0xa0
[ 1599.358705]  notify_change+0x3c1/0x540
[ 1599.362591]  ? do_truncate+0x7d/0xd0
[ 1599.366292]  do_truncate+0x7d/0xd0
[ 1599.369836]  do_sys_ftruncate+0xc9/0x150
[ 1599.373882]  do_syscall_64+0x38/0x90
[ 1599.377581]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 1599.382757] RIP: 0033:0x7fc405986757
[ 1599.386449] RSP: 002b:00007ffdf6776af8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
[ 1599.394126] RAX: ffffffffffffffda RBX: 00007ffdf6776b40 RCX: 00007fc405986757
[ 1599.401371] RDX: 0000557b78aa8f40 RSI: 0000000001000000 RDI: 0000000000000015
[ 1599.408615] RBP: 0000557b78aac520 R08: 0000000000000001 R09: 0000557b78aac5ac
[ 1599.415860] R10: 0000000000000000 R11: 0000000000000202 R12: 0000557b78a9c600
[ 1599.423119] R13: 00007ffdf6776b38 R14: 0000000000000003 R15: 0000000000000000
[ 1599.430406]  </TASK>
[ 1599.432754] INFO: task systemd-timesyn:1277 blocked for more than 123 seconds.
[ 1599.440097]       Not tainted 5.15.63-flatcar #1
[ 1599.444842] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1599.452799] task:systemd-timesyn state:D stack:    0 pid: 1277 ppid:     1 flags:0x00000224
[ 1599.461268] Call Trace:
[ 1599.463836]  <TASK>
[ 1599.466046]  __schedule+0x2eb/0x8d0
[ 1599.469684]  schedule+0x5b/0xd0
[ 1599.472946]  wait_transaction_locked+0x8a/0xd0 [jbd2]
[ 1599.478133]  ? wait_woken+0x70/0x70
[ 1599.481839]  add_transaction_credits+0xd9/0x2b0 [jbd2]
[ 1599.487387]  ? __fget_files+0x79/0xb0
[ 1599.491185]  start_this_handle+0xfb/0x520 [jbd2]
[ 1599.495963]  ? nd_jump_link+0x4d/0xc0
[ 1599.499752]  ? __cond_resched+0x16/0x50
[ 1599.503708]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 1599.508620]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 1599.513897]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 1599.518477]  __mark_inode_dirty+0x144/0x320
[ 1599.522795]  ext4_setattr+0x1fb/0x9a0 [ext4]
[ 1599.527200]  notify_change+0x3c1/0x540
[ 1599.531071]  ? vfs_utimes+0x139/0x220
[ 1599.534845]  vfs_utimes+0x139/0x220
[ 1599.538446]  do_utimes+0xb4/0x120
[ 1599.541874]  __x64_sys_utimensat+0x70/0xb0
[ 1599.546132]  ? syscall_trace_enter.constprop.0+0x143/0x1c0
[ 1599.551751]  do_syscall_64+0x38/0x90
[ 1599.555443]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 1599.560628] RIP: 0033:0x7fce21af901f
[ 1599.564858] RSP: 002b:00007ffedd36c8c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000118
[ 1599.572559] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fce21af901f
[ 1599.579806] RDX: 0000000000000000 RSI: 00007ffedd36c8d0 RDI: 00000000ffffff9c
[ 1599.587062] RBP: 00007ffedd36c8d0 R08: 0000000000000000 R09: 00007ffedd36c760
[ 1599.594313] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[ 1599.601558] R13: 00000000ffffffff R14: ffffffffffffffff R15: 00000000ffffffff
[ 1599.608806]  </TASK>
[ 1599.611124] INFO: task bash:1925 blocked for more than 123 seconds.
[ 1599.617618]       Not tainted 5.15.63-flatcar #1
[ 1599.622347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1599.630283] task:bash            state:D stack:    0 pid: 1925 ppid:  1924 flags:0x00000004
[ 1599.638750] Call Trace:
[ 1599.641309]  <TASK>
[ 1599.643528]  __schedule+0x2eb/0x8d0
[ 1599.647140]  schedule+0x5b/0xd0
[ 1599.650399]  wait_transaction_locked+0x8a/0xd0 [jbd2]
[ 1599.655571]  ? wait_woken+0x70/0x70
[ 1599.659188]  add_transaction_credits+0xd9/0x2b0 [jbd2]
[ 1599.664444]  ? pagecache_get_page+0x28b/0x470
[ 1599.668914]  start_this_handle+0xfb/0x520 [jbd2]
[ 1599.673664]  ? __cond_resched+0x16/0x50
[ 1599.677615]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 1599.682525]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 1599.687801]  __ext4_new_inode+0x73f/0x1710 [ext4]
[ 1599.692639]  ext4_create+0x115/0x1d0 [ext4]
[ 1599.696968]  path_openat+0xf48/0x1280
[ 1599.700751]  ? _raw_spin_unlock_irqrestore+0xa/0x30
[ 1599.705742]  ? __wake_up_common_lock+0x8a/0xc0
[ 1599.710299]  do_filp_open+0xa9/0x150
[ 1599.713990]  ? vfs_statx+0x74/0x130
[ 1599.717615]  ? __check_object_size+0x146/0x160
[ 1599.722179]  do_sys_openat2+0x9b/0x160
[ 1599.726057]  __x64_sys_openat+0x54/0xa0
[ 1599.730003]  do_syscall_64+0x38/0x90
[ 1599.733691]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 1599.738870] RIP: 0033:0x7f138fdb5337
[ 1599.742558] RSP: 002b:00007ffdc49dcaa0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 1599.750249] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f138fdb5337
[ 1599.757497] RDX: 0000000000000241 RSI: 00005583f2829d70 RDI: 00000000ffffff9c
[ 1599.764747] RBP: 00005583f2829d70 R08: 0000000000000000 R09: 0000000000000000
[ 1599.771998] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000000000241
[ 1599.779258] R13: 0000000000000000 R14: 00005583f1f45534 R15: 0000000000000000
[ 1599.786510]  </TASK>
[ 1599.788906] INFO: task MVStore backgro:8970 blocked for more than 123 seconds.
[ 1599.796262]       Not tainted 5.15.63-flatcar #1
[ 1599.800992] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1599.808931] task:MVStore backgro state:D stack:    0 pid: 8970 ppid:  8062 flags:0x00000000
[ 1599.817398] Call Trace:
[ 1599.819968]  <TASK>
[ 1599.822205]  __schedule+0x2eb/0x8d0
[ 1599.825816]  schedule+0x5b/0xd0
[ 1599.829073]  wait_transaction_locked+0x8a/0xd0 [jbd2]
[ 1599.834250]  ? wait_woken+0x70/0x70
[ 1599.837852]  add_transaction_credits+0xd9/0x2b0 [jbd2]
[ 1599.843108]  start_this_handle+0xfb/0x520 [jbd2]
[ 1599.847859]  ? __cond_resched+0x16/0x50
[ 1599.851810]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 1599.856715]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 1599.861992]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 1599.866585]  __mark_inode_dirty+0x144/0x320
[ 1599.870892]  generic_update_time+0x6c/0xd0
[ 1599.875118]  file_update_time+0x127/0x140
[ 1599.879242]  ? generic_write_checks+0x61/0xc0
[ 1599.883721]  ext4_buffered_write_iter+0x5a/0x180 [ext4]
[ 1599.889103]  do_iter_readv_writev+0x14f/0x1b0
[ 1599.893582]  do_iter_write+0x80/0x1c0
[ 1599.897370]  ovl_write_iter+0x2d3/0x450 [overlay]
[ 1599.902195]  new_sync_write+0x119/0x1b0
[ 1599.906151]  ? intel_get_event_constraints+0x300/0x390
[ 1599.911403]  vfs_write+0x1de/0x270
[ 1599.914917]  __x64_sys_pwrite64+0x91/0xc0
[ 1599.919039]  do_syscall_64+0x38/0x90
[ 1599.922727]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 1599.927968] RIP: 0033:0x7f2f532424a3
[ 1599.931659] RSP: 002b:00007f2f2cc8aa78 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[ 1599.939352] RAX: ffffffffffffffda RBX: 00007f2f2cc8bb38 RCX: 00007f2f532424a3
[ 1599.946606] RDX: 0000000000001000 RSI: 00007f2f2c769d90 RDI: 0000000000000014
[ 1599.953875] RBP: 00007f2f2cc8aaf0 R08: 0000000000000000 R09: 0000000000000000
[ 1599.961127] R10: 00000000000a3000 R11: 0000000000000246 R12: 0000000000000012
[ 1599.968527] R13: 00007f2f2c769d90 R14: 00000000000a3000 R15: 00007f2f2f94b800
[ 1599.975775]  </TASK>
[ 1599.978135] INFO: task k3s-server:12051 blocked for more than 123 seconds.
[ 1599.985142]       Not tainted 5.15.63-flatcar #1
[ 1599.989873] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1599.997816] task:k3s-server      state:D stack:    0 pid:12051 ppid:     1 flags:0x00000000
[ 1600.006283] Call Trace:
[ 1600.008862]  <TASK>
[ 1600.011083]  __schedule+0x2eb/0x8d0
[ 1600.014779]  schedule+0x5b/0xd0
[ 1600.018034]  rwsem_down_write_slowpath+0x220/0x4f0
[ 1600.022947]  chown_common+0x152/0x250
[ 1600.026740]  ? __do_sys_newfstat+0x57/0x60
[ 1600.030950]  ? __fget_files+0x79/0xb0
[ 1600.034749]  ksys_fchown+0x74/0xb0
[ 1600.038283]  __x64_sys_fchown+0x16/0x20
[ 1600.042230]  do_syscall_64+0x38/0x90
[ 1600.045922]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 1600.051091] RIP: 0033:0x3f8045f
[ 1600.054346] RSP: 002b:00007f749fe0a670 EFLAGS: 00000202 ORIG_RAX: 000000000000005d
[ 1600.062050] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000003f8045f
[ 1600.069297] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000188
[ 1600.076557] RBP: 00007f749dfd2498 R08: 0000000000000188 R09: 000000000572a6fb
[ 1600.083808] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000188
[ 1600.091051] R13: 00007f749f5416cd R14: 0000000000080006 R15: 00000000000001a4
[ 1600.098312]  </TASK>
[ 1600.100611] INFO: task k3s-server:12052 blocked for more than 123 seconds.
[ 1600.107601]       Not tainted 5.15.63-flatcar #1
[ 1600.112344] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1600.120281] task:k3s-server      state:D stack:    0 pid:12052 ppid:     1 flags:0x00000000
[ 1600.128758] Call Trace:
[ 1600.131315]  <TASK>
[ 1600.133536]  __schedule+0x2eb/0x8d0
[ 1600.137136]  schedule+0x5b/0xd0
[ 1600.140408]  wait_transaction_locked+0x8a/0xd0 [jbd2]
[ 1600.145574]  ? wait_woken+0x70/0x70
[ 1600.149190]  add_transaction_credits+0xd9/0x2b0 [jbd2]
[ 1600.154442]  ? __cond_resched+0x16/0x50
[ 1600.158395]  ? dput+0x32/0x310
[ 1600.161559]  start_this_handle+0xfb/0x520 [jbd2]
[ 1600.166290]  ? asm_sysvec_apic_timer_interrupt+0x15/0x20
[ 1600.171713]  ? __cond_resched+0x16/0x50
[ 1600.175660]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[ 1600.180568]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[ 1600.185846]  ext4_dirty_inode+0x35/0x80 [ext4]
[ 1600.190441]  __mark_inode_dirty+0x144/0x320
[ 1600.194737]  ext4_setattr+0x1fb/0x9a0 [ext4]
[ 1600.199134]  notify_change+0x3c1/0x540
[ 1600.203002]  ? chown_common+0x168/0x250
[ 1600.206957]  chown_common+0x168/0x250
[ 1600.210731]  ? __fget_files+0x79/0xb0
[ 1600.214502]  ksys_fchown+0x74/0xb0
[ 1600.218017]  __x64_sys_fchown+0x16/0x20
[ 1600.221965]  do_syscall_64+0x38/0x90
[ 1600.225674]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 1600.230851] RIP: 0033:0x3f8045f
[ 1600.234114] RSP: 002b:00007f749fde7670 EFLAGS: 00000202 ORIG_RAX: 000000000000005d
[ 1600.241794] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000003f8045f
[ 1600.249050] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000145
[ 1600.256388] RBP: 00007f749e2213c8 R08: 0000000000000145 R09: 000000000572a6fb
[ 1600.263636] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000145
[ 1600.270890] R13: 00007f749dfbef0d R14: 0000000000080006 R15: 00000000000001a4
[ 1600.278138]  </TASK>
[ 1600.280436] INFO: task k3s-server:12055 blocked for more than 124 seconds.
[ 1600.287449]       Not tainted 5.15.63-flatcar #1
[ 1600.292184] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1600.300127] task:k3s-server      state:D stack:    0 pid:12055 ppid:     1 flags:0x00000000
[ 1600.308590] Call Trace:
[ 1600.311151]  <TASK>
[ 1600.313358]  __schedule+0x2eb/0x8d0
[ 1600.316957]  schedule+0x5b/0xd0
[ 1600.320208]  rwsem_down_write_slowpath+0x220/0x4f0
[ 1600.325113]  chown_common+0x152/0x250
[ 1600.328890]  ? __do_sys_newfstat+0x57/0x60
[ 1600.333107]  ? __fget_files+0x79/0xb0
[ 1600.336885]  ksys_fchown+0x74/0xb0
[ 1600.340400]  __x64_sys_fchown+0x16/0x20
[ 1600.344346]  do_syscall_64+0x38/0x90
[ 1600.348035]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 1600.353218] RIP: 0033:0x3f8045f
[ 1600.356471] RSP: 002b:00007f749fcae670 EFLAGS: 00000202 ORIG_RAX: 000000000000005d
[ 1600.364149] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000003f8045f
[ 1600.371397] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000019c
[ 1600.378648] RBP: 00007f749f539888 R08: 000000000000019c R09: 000000000572a6fb
[ 1600.385895] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000019c
[ 1600.393155] R13: 00007f749e2e5a3d R14: 0000000000080006 R15: 00000000000001a4
[ 1600.400407]  </TASK>
[ 1600.402761] INFO: task k3s-server:12057 blocked for more than 124 seconds.
[ 1600.409779]       Not tainted 5.15.63-flatcar #1
[ 1600.414512] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1600.422470] task:k3s-server      state:D stack:    0 pid:12057 ppid:     1 flags:0x00000000
[ 1600.430934] Call Trace:
[ 1600.433510]  <TASK>
[ 1600.435722]  __schedule+0x2eb/0x8d0
[ 1600.439328]  schedule+0x5b/0xd0
[ 1600.442583]  rwsem_down_write_slowpath+0x220/0x4f0
[ 1600.447495]  chown_common+0x152/0x250
[ 1600.451274]  ? __do_sys_newfstat+0x57/0x60
[ 1600.455483]  ? __fget_files+0x79/0xb0
[ 1600.459258]  ksys_fchown+0x74/0xb0
[ 1600.462773]  __x64_sys_fchown+0x16/0x20
[ 1600.466724]  do_syscall_64+0x38/0x90
[ 1600.470416]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 1600.475604] RIP: 0033:0x3f8045f
[ 1600.478857] RSP: 002b:00007f749fc28670 EFLAGS: 00000202 ORIG_RAX: 000000000000005d
[ 1600.486539] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000003f8045f
[ 1600.493788] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000196
[ 1600.501050] RBP: 00007f749f64dac8 R08: 0000000000000196 R09: 000000000572a6fb
[ 1600.508318] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000196
[ 1600.515569] R13: 00007f749eadef0d R14: 0000000000080006 R15: 00000000000001a4
[ 1600.522817]  </TASK>
[ 1629.084561] systemd[1]: systemd-journald.service: Processes still around after SIGKILL. Ignoring.

Awesome, thank you very much for testing! We’re discussing this issue with Jan Kara, an Ext4 maintainer, upstream. (https://www.spinics.net/lists/linux-ext4/msg85417.html ff., messages from today will become available in the archive tomorrow). Unfortunately, there seems to be an issue with the stack traces we provided (from this very issue); we’re currently looking into possible causes and will follow up with the maintainer.

I have already started a build on our infra for the patched image, so that should be fine, it will probably take an hour and a bit for this to build. Ill look at getting it rolled out and start testing by the afternoon, so hopefully ill have some results either later today or tomorrow some time. Ill be off from Friday till Monday so it may also be worth you guys doing some testing on your end in case I am unable to get any results by then.

I was not able to repro with the reproducer in my own testing, but an infrastructure VM that we have on the stable channel has also hit this (14 days of uptime, no particular io load):

log:
[1282119.153912] INFO: task jbd2/sda9-8:544 blocked for more than 122 seconds.
[1282119.157088]       Not tainted 5.15.63-flatcar #1
[1282119.159281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1282119.162723] task:jbd2/sda9-8     state:D stack:    0 pid:  544 ppid:     2 flags:0x00004000
[1282119.166441] Call Trace:
[1282119.167640]  <TASK>
[1282119.168675]  __schedule+0x2eb/0x8d0
[1282119.170341]  schedule+0x5b/0xd0
[1282119.171806]  jbd2_journal_commit_transaction+0x301/0x2850 [jbd2]
[1282119.175448]  ? wait_woken+0x70/0x70
[1282119.177174]  ? lock_timer_base+0x61/0x80
[1282119.179015]  jbd2_journal_check_available_features+0x1ab/0x3f0 [jbd2]
[1282119.181922]  ? wait_woken+0x70/0x70
[1282119.183533]  ? jbd2_journal_check_available_features+0x100/0x3f0 [jbd2]
[1282119.186566]  kthread+0x127/0x150
[1282119.188087]  ? set_kthread_struct+0x50/0x50
[1282119.190346]  ret_from_fork+0x22/0x30
[1282119.192027]  </TASK>
[1282119.193081] INFO: task systemd-journal:748 blocked for more than 122 seconds.
[1282119.196255]       Not tainted 5.15.63-flatcar #1
[1282119.198321] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1282119.201776] task:systemd-journal state:D stack:    0 pid:  748 ppid:     1 flags:0x00000004
[1282119.205604] Call Trace:
[1282119.206773]  <TASK>
[1282119.207794]  __schedule+0x2eb/0x8d0
[1282119.209410]  schedule+0x5b/0xd0
[1282119.210887]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[1282119.213342]  ? wait_woken+0x70/0x70
[1282119.214946]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[1282119.217482]  ? call_rcu+0xa2/0x330
[1282119.219070]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[1282119.221672]  ? step_into+0x47c/0x7b0
[1282119.223372]  ? __cond_resched+0x16/0x50
[1282119.225175]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[1282119.227342]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[1282119.229688]  ext4_dirty_inode+0x35/0x80 [ext4]
[1282119.231735]  __mark_inode_dirty+0x147/0x320
[1282119.233712]  touch_atime+0x13c/0x150
[1282119.235429]  filemap_read+0x308/0x320
[1282119.238370]  ? may_delete+0x2a0/0x2f0
[1282119.240286]  ? do_filp_open+0xa9/0x150
[1282119.242071]  new_sync_read+0x119/0x1b0
[1282119.244052]  ? 0xffffffffad000000
[1282119.245736]  vfs_read+0xf6/0x190
[1282119.247362]  __x64_sys_pread64+0x91/0xc0
[1282119.249365]  do_syscall_64+0x3b/0x90
[1282119.251173]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[1282119.253772] RIP: 0033:0x7f59ebb32b17
[1282119.255634] RSP: 002b:00007ffefe2b6520 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
[1282119.259426] RAX: ffffffffffffffda RBX: 00007ffefe2b65d0 RCX: 00007f59ebb32b17
[1282119.262876] RDX: 0000000000000040 RSI: 00007ffefe2b6550 RDI: 0000000000000020
[1282119.266269] RBP: 000000000210fdb0 R08: 0000000000000000 R09: 00007ffefe2b6760
[1282119.270866] R10: 000000000210fdb0 R11: 0000000000000293 R12: 000055705e025030
[1282119.274357] R13: 0000000000000000 R14: 00007ffefe2b6550 R15: 000055705e025030
[1282119.277808]  </TASK>
[1282119.279102] INFO: task python:1117 blocked for more than 123 seconds.
[1282119.282317]       Not tainted 5.15.63-flatcar #1
[1282119.284531] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1282119.288819] task:python          state:D stack:    0 pid: 1117 ppid:  1010 flags:0x00000000
[1282119.292823] Call Trace:
[1282119.294282]  <TASK>
[1282119.295429]  __schedule+0x2eb/0x8d0
[1282119.297126]  schedule+0x5b/0xd0
[1282119.298734]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[1282119.301713]  ? wait_woken+0x70/0x70
[1282119.303438]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[1282119.306313]  ? hrtimer_try_to_cancel.part.0+0x50/0xd0
[1282119.308836]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[1282119.311496]  ? mlx5e_select_queue+0x3c/0x2d0 [mlx5_core]
[1282119.314170]  ? __cond_resched+0x16/0x50
[1282119.315956]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[1282119.318816]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[1282119.321540]  ext4_setattr+0x3a8/0x9a0 [ext4]
[1282119.323842]  notify_change+0x3c4/0x540
[1282119.325750]  ? ext4_es_delayed_clu+0x170/0x430 [ext4]
[1282119.328375]  ? ext4_es_delayed_clu+0x1ef/0x430 [ext4]
[1282119.330894]  ? do_truncate+0x7d/0xd0
[1282119.332698]  do_truncate+0x7d/0xd0
[1282119.334571]  path_openat+0x24d/0x1280
[1282119.336461]  do_filp_open+0xa9/0x150
[1282119.338190]  ? __check_object_size+0x146/0x160
[1282119.340409]  do_sys_openat2+0x9b/0x160
[1282119.342289]  __x64_sys_openat+0x54/0xa0
[1282119.344180]  do_syscall_64+0x3b/0x90
[1282119.346050]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[1282119.348496] RIP: 0033:0x7f47adf699e4
[1282119.352234] RSP: 002b:00007ffd95f911e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
[1282119.357348] RAX: ffffffffffffffda RBX: 00007f47ac34a558 RCX: 00007f47adf699e4
[1282119.360865] RDX: 0000000000080241 RSI: 00007f47acb90b48 RDI: 00000000ffffff9c
[1282119.364381] RBP: 00007f47acb90b48 R08: 0000000000000000 R09: 0000561f71986760
[1282119.369496] R10: 00000000000001b6 R11: 0000000000000293 R12: 0000000000080241
[1282119.374035] R13: 00007f47acce1a80 R14: 00007f47adc43248 R15: 0000000000000000
[1282119.377852]  </TASK>
[1282119.379101] INFO: task VM Periodic Tas:11249 blocked for more than 123 seconds.
[1282119.383214]       Not tainted 5.15.63-flatcar #1
[1282119.385459] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1282119.389306] task:VM Periodic Tas state:D stack:    0 pid:11249 ppid:  8007 flags:0x00000000
[1282119.393418] Call Trace:
[1282119.394722]  <TASK>
[1282119.395852]  __schedule+0x2eb/0x8d0
[1282119.397882]  schedule+0x5b/0xd0
[1282119.399608]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[1282119.402324]  ? wait_woken+0x70/0x70
[1282119.404155]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[1282119.406854]  ? update_load_avg+0x7a/0x5e0
[1282119.408822]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[1282119.411679]  ? psi_task_switch+0x1e0/0x200
[1282119.414387]  ? __cond_resched+0x16/0x50
[1282119.416760]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[1282119.419236]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[1282119.421878]  ext4_dirty_inode+0x35/0x80 [ext4]
[1282119.424391]  __mark_inode_dirty+0x147/0x320
[1282119.426430]  generic_update_time+0x6c/0xd0
[1282119.428570]  file_update_time+0x127/0x140
[1282119.430744]  ext4_page_mkwrite+0x86/0xc50 [ext4]
[1282119.433058]  do_page_mkwrite+0x55/0xc0
[1282119.434861]  do_wp_page+0x20a/0x300
[1282119.436643]  ? task_tick_fair+0x7c/0x370
[1282119.438526]  __handle_mm_fault+0xb7a/0x1470
[1282119.440623]  handle_mm_fault+0xcf/0x2b0
[1282119.442491]  do_user_addr_fault+0x1be/0x670
[1282119.444588]  exc_page_fault+0x68/0x140
[1282119.446694]  asm_exc_page_fault+0x21/0x30
[1282119.448720] RIP: 0033:0x7f08a7c0b29e
[1282119.450453] RSP: 002b:00007f0879ffed30 EFLAGS: 00010206
[1282119.453058] RAX: 00048c1340635bb2 RBX: 00007f08a66851b8 RCX: 0000000000000018
[1282119.456630] RDX: 0000000000000000 RSI: 0000000000138faa RDI: 0000000000000001
[1282119.460183] RBP: 00007f0879ffed40 R08: 0000000000000000 R09: 0000000000000032
[1282119.467273] R10: 00007ffe86f54080 R11: 00007ffe86f54090 R12: 0000000000000008
[1282119.470823] R13: 00007f08a0303b40 R14: 0000000000000002 R15: 0000000000000001
[1282119.474328]  </TASK>
[1282119.475620] INFO: task jenkins.util.Ti:11283 blocked for more than 123 seconds.
[1282119.479241]       Not tainted 5.15.63-flatcar #1
[1282119.481472] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1282119.485306] task:jenkins.util.Ti state:D stack:    0 pid:11283 ppid:  8007 flags:0x00000000
[1282119.489061] Call Trace:
[1282119.490271]  <TASK>
[1282119.491341]  __schedule+0x2eb/0x8d0
[1282119.492969]  schedule+0x5b/0xd0
[1282119.494583]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[1282119.497032]  ? wait_woken+0x70/0x70
[1282119.498657]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[1282119.501246]  ? pagecache_get_page+0x28b/0x470
[1282119.503242]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[1282119.505761]  ? __cond_resched+0x16/0x50
[1282119.507555]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[1282119.509870]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[1282119.512310]  __ext4_new_inode+0x73f/0x1710 [ext4]
[1282119.514499]  ext4_insert_dentry+0x1c75/0x1d30 [ext4]
[1282119.517080]  path_openat+0xf4b/0x1280
[1282119.518780]  do_filp_open+0xa9/0x150
[1282119.520453]  ? vfs_statx+0x74/0x130
[1282119.522096]  ? __check_object_size+0x146/0x160
[1282119.524139]  do_sys_openat2+0x9b/0x160
[1282119.525976]  __x64_sys_openat+0x54/0xa0
[1282119.527774]  do_syscall_64+0x3b/0x90
[1282119.529493]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[1282119.531871] RIP: 0033:0x7f08a89067e4
[1282119.533542] RSP: 002b:00007f08079fa410 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
[1282119.536954] RAX: ffffffffffffffda RBX: 00007f08079fa588 RCX: 00007f08a89067e4
[1282119.540170] RDX: 00000000000000c2 RSI: 00007f084c02c500 RDI: 00000000ffffff9c
[1282119.543579] RBP: 00007f084c02c500 R08: 0000000000000000 R09: 00000007acb8f77e
[1282119.546883] R10: 00000000000001b6 R11: 0000000000000293 R12: 00000000000000c2
[1282119.550079] R13: 00000000000000c2 R14: 00007f084c02c500 R15: 00007f08240f4000
[1282119.553255]  </TASK>
[1282119.554327] INFO: task SCM polling for:13607 blocked for more than 123 seconds.
[1282119.557600]       Not tainted 5.15.63-flatcar #1
[1282119.559902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1282119.563425] task:SCM polling for state:D stack:    0 pid:13607 ppid:  8007 flags:0x00000000
[1282119.567152] Call Trace:
[1282119.568335]  <TASK>
[1282119.569377]  __schedule+0x2eb/0x8d0
[1282119.571001]  ? recalibrate_cpu_khz+0x10/0x10
[1282119.572959]  schedule+0x5b/0xd0
[1282119.575701]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[1282119.578165]  ? wait_woken+0x70/0x70
[1282119.579791]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[1282119.582297]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[1282119.584770]  ? __update_load_avg_cfs_rq+0x28c/0x310
[1282119.587014]  ? __cond_resched+0x16/0x50
[1282119.588776]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[1282119.591102]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[1282119.593532]  ext4_setattr+0x3a8/0x9a0 [ext4]
[1282119.595540]  notify_change+0x3c4/0x540
[1282119.597319]  ? ext4_es_delayed_clu+0x170/0x430 [ext4]
[1282119.599641]  ? ext4_es_delayed_clu+0x1ef/0x430 [ext4]
[1282119.601964]  ? do_truncate+0x7d/0xd0
[1282119.603622]  do_truncate+0x7d/0xd0
[1282119.605236]  path_openat+0x24d/0x1280
[1282119.606999]  do_filp_open+0xa9/0x150
[1282119.608658]  ? __fput+0xff/0x250
[1282119.610171]  ? __check_object_size+0x146/0x160
[1282119.612351]  do_sys_openat2+0x9b/0x160
[1282119.614152]  __x64_sys_openat+0x54/0xa0
[1282119.615983]  do_syscall_64+0x3b/0x90
[1282119.617649]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[1282119.619993] RIP: 0033:0x7f08a89067e4
[1282119.621775] RSP: 002b:00007f07f7faa2e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
[1282119.625933] RAX: ffffffffffffffda RBX: 000000071bca3a90 RCX: 00007f08a89067e4
[1282119.629183] RDX: 0000000000000241 RSI: 00007f0840094c40 RDI: 00000000ffffff9c
[1282119.632397] RBP: 00007f0840094c40 R08: 0000000000000000 R09: 0000000000000051
[1282119.635645] R10: 00000000000001b6 R11: 0000000000000293 R12: 0000000000000241
[1282119.638964] R13: 0000000000000241 R14: 00007f0840094c40 R15: 00007f081402cb48
[1282119.642228]  </TASK>
[1282119.643335] INFO: task Handling GET /l:830503 blocked for more than 123 seconds.
[1282119.646702]       Not tainted 5.15.63-flatcar #1
[1282119.648846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1282119.652383] task:Handling GET /l state:D stack:    0 pid:830503 ppid:  8007 flags:0x00000000
[1282119.656380] Call Trace:
[1282119.657559]  <TASK>
[1282119.658594]  __schedule+0x2eb/0x8d0
[1282119.660229]  ? trigger_load_balance+0x60/0x300
[1282119.662271]  schedule+0x5b/0xd0
[1282119.663751]  rwsem_down_read_slowpath+0x32c/0x380
[1282119.665950]  ? hw_breakpoint_exceptions_notify+0x30/0xf0
[1282119.668366]  ? ktime_get+0x38/0xa0
[1282119.670018]  do_user_addr_fault+0x42f/0x670
[1282119.672059]  ? read_hv_sched_clock_tsc+0x5/0x40
[1282119.674158]  exc_page_fault+0x68/0x140
[1282119.675939]  asm_exc_page_fault+0x21/0x30
[1282119.677791] RIP: 0033:0x7f08a77a4d3c
[1282119.679642] RSP: 002b:00007f07f521bff0 EFLAGS: 00010202
[1282119.682089] RAX: 00007f08a6683d28 RBX: 00007f08a00402d0 RCX: 000000009c700000
[1282119.685259] RDX: 000000007ce00000 RSI: 000000007ce00000 RDI: 00007f08a0195bb0
[1282119.691672] RBP: 00007f07f521bff0 R08: 00007f08a850987c R09: 00007f0835091f20
[1282119.694867] R10: ffffffffffffffff R11: 0000000000000000 R12: 00007f083c9b61d0
[1282119.698137] R13: 0000000000000000 R14: 00007f08a76e57b0 R15: 0000000000000800
[1282119.701408]  </TASK>
[1282119.702576] INFO: task org.jenkinsci.p:875804 blocked for more than 123 seconds.
[1282119.705988]       Not tainted 5.15.63-flatcar #1
[1282119.708152] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1282119.711696] task:org.jenkinsci.p state:D stack:    0 pid:875804 ppid:  8007 flags:0x00000000
[1282119.715495] Call Trace:
[1282119.716676]  <TASK>
[1282119.717757]  __schedule+0x2eb/0x8d0
[1282119.719495]  ? update_load_avg+0x7a/0x5e0
[1282119.721348]  schedule+0x5b/0xd0
[1282119.722885]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[1282119.725377]  ? wait_woken+0x70/0x70
[1282119.727048]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[1282119.729522]  ? timerqueue_del+0x2a/0x50
[1282119.731304]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[1282119.733797]  ? hrtimer_cancel+0x1d/0x40
[1282119.735798]  ? futex_wait+0x21a/0x240
[1282119.737584]  ? __cond_resched+0x16/0x50
[1282119.739363]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[1282119.741573]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[1282119.743920]  ext4_dirty_inode+0x35/0x80 [ext4]
[1282119.745987]  __mark_inode_dirty+0x147/0x320
[1282119.748024]  generic_update_time+0x6c/0xd0
[1282119.749950]  file_update_time+0x127/0x140
[1282119.751906]  ? generic_write_checks+0x61/0xc0
[1282119.753905]  ext4_llseek+0x3fa/0x1cb0 [ext4]
[1282119.755906]  new_sync_write+0x11c/0x1b0
[1282119.757693]  ? intel_get_event_constraints+0x300/0x390
[1282119.760046]  vfs_write+0x1de/0x270
[1282119.761659]  ksys_write+0x5f/0xe0
[1282119.763218]  do_syscall_64+0x3b/0x90
[1282119.764911]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[1282119.767352] RIP: 0033:0x7f08a8905fef
[1282119.769060] RSP: 002b:00007f07f1ade4c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[1282119.772473] RAX: ffffffffffffffda RBX: 00007f0814407348 RCX: 00007f08a8905fef
[1282119.775715] RDX: 000000000000005e RSI: 00007f07f1ade540 RDI: 0000000000000338
[1282119.779100] RBP: 00007f07f1ade510 R08: 0000000000000000 R09: 000000076ae08648
[1282119.782445] R10: 0000000000178e34 R11: 0000000000000293 R12: 000000000000005e
[1282119.785816] R13: 00007f07f1ade540 R14: 0000000000000338 R15: 000000000000005e
[1282119.789069]  </TASK>
[1282119.790140] INFO: task kworker/u8:2:874634 blocked for more than 123 seconds.
[1282119.793381]       Not tainted 5.15.63-flatcar #1
[1282119.795512] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1282119.800485] task:kworker/u8:2    state:D stack:    0 pid:874634 ppid:     2 flags:0x00004000
[1282119.804354] Workqueue: writeback wb_workfn (flush-8:0)
[1282119.806726] Call Trace:
[1282119.807936]  <TASK>
[1282119.809026]  __schedule+0x2eb/0x8d0
[1282119.810692]  schedule+0x5b/0xd0
[1282119.812181]  jbd2_journal_free_reserved+0xca/0x9d0 [jbd2]
[1282119.814641]  ? wait_woken+0x70/0x70
[1282119.816909]  jbd2_journal_free_reserved+0x2d9/0x9d0 [jbd2]
[1282119.819409]  ? find_get_pages_range+0x197/0x200
[1282119.821507]  jbd2_journal_free_reserved+0x5ab/0x9d0 [jbd2]
[1282119.823988]  ? ext4_convert_inline_data+0xb07/0x2020 [ext4]
[1282119.826555]  ? __cond_resched+0x16/0x50
[1282119.828352]  jbd2__journal_start+0xfb/0x1e0 [jbd2]
[1282119.830542]  __ext4_journal_start_sb+0xf8/0x110 [ext4]
[1282119.833049]  __ext4_mark_inode_dirty+0x502/0x1880 [ext4]
[1282119.835474]  ? __cond_resched+0x16/0x50
[1282119.837281]  ? __getblk_gfp+0x27/0x60
[1282119.838986]  ? __ext4_handle_dirty_metadata+0x56/0x19b0 [ext4]
[1282119.841629]  ? ext4_mark_iloc_dirty+0x56a/0xaf0 [ext4]
[1282119.843979]  do_writepages+0xd1/0x200
[1282119.845682]  __writeback_single_inode+0x39/0x290
[1282119.847884]  writeback_sb_inodes+0x20d/0x490
[1282119.849861]  __writeback_inodes_wb+0x4c/0xe0
[1282119.851844]  wb_writeback+0x1c0/0x280
[1282119.853561]  wb_workfn+0x29f/0x4d0
[1282119.855195]  ? psi_task_switch+0x1e0/0x200
[1282119.857128]  process_one_work+0x226/0x3c0
[1282119.859031]  worker_thread+0x50/0x410
[1282119.860747]  ? process_one_work+0x3c0/0x3c0
[1282119.862674]  kthread+0x127/0x150
[1282119.864307]  ? set_kthread_struct+0x50/0x50
[1282119.867863]  ret_from_fork+0x22/0x30
[1282119.869538]  </TASK>

We are still facing the issue with the current stable release 3374.2.3: grafik

Beta 3432.1.0 is working fine for us.

No more tears with beta 3432.1.0.

Looking forward to the next stable channel release to solve this problem.

is this still not in stable? I’ve tought it already is and updated to 3374.2.1 and still get processes to hang.

@databus23 We have worked around this by changing the root FS from Ext4 to XFS. Our tests look good so far but haven’t rolled it out to production yet. In case you need to upgrade flatcar for some reason, this might be an alternative idea for you until the ext4 fix has made it’s way up to a flatcar stable release.

@jepio @t-lo I will try and have a look at this over the next 2 days

We also experience this. Any ideas if and when this will be solved? Is this certain that’s vanilla kernel issue? If so, other distos should have the same problem as well. We’re already making decision to come back to LTS due this… but what if this comes to LTS too? This is pretty serious. 😕

@t-lo Yep I can give this a go, although I may not be able to provide results till some time next week

We can reproduce it as soon as we generate a lot of IO on the hard disk.

Any specific you executed for that? I shuffled around some 100 GB’s using dd and could not trigger the issue.

The servers in our company have the same symptoms after an update to 3227.2.2.

We can reproduce it as soon as we generate a lot of IO on the hard disk.

Unfortunately I don’t see any errors in systemd / kernel ring buffer. Only File /var/log/journal/<some-id>/system.journal corrupted or uncleanly shut down, renaming and replacing. shortly after that the system stops responding.

As a temporary solution we did a rollback to 3227.2.1 for now.

We are still seeing lock-ups of nodes with 3227.2.2 on a daily basis in our fleet but still fail to reproduce the error some what consistently. Affected nodes are hanging, login via ssh or console is just haging. We are always seeing stack traces of “hung task timeouts” that are io/fs related.