Flatcar: Stable 3227.2.2 randomly causes processes to hang on I/O related operations
Description
We’ve seen multiple nodes (different regions and environments) stalling completely (unresponsive kubelets and containerds, journald and timesyncd units failing, no login possible, …) on the 3227.2.2 release. This seems to be happening mostly on the reboot after the update, but we also had this occurring at random.
Impact
This causes Kubernetes nodes to become NotReady
before being drained, which involves volumes not being able to be moved and therefore service interruptions.
Environment and steps to reproduce
- Set-up: Flatcar Stable 3227.2.2, OpenStack, ESXi hypervisors
- Task: on node start, but also during normal operation
- Action(s): no clue if anything specific causes this
- Error:
- we get
task blocked for more than 120 seconds
errors and related call stacks, see the screenshot -
- CPU usage is very higher when the issue occurs
- the journal gets corrupted when this issue occurs
- rebooting once more, brings the node back
- once the node is up again, partitions and filesystems appear healthy
Expected behavior
The nodes do not stall completely.
Additional information
We have the feeling that we may hit a kernel bug as we only see this on the 3227.2.2 release were basically only the kernel was changed. Do you have ideas how we can diagnose this further? Thanks.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 15
- Comments: 59 (24 by maintainers)
We’ll make sure the fix (either 5.15.87 or the backport) is part of the next stable release scheduled for 2022-01-23.
There’s a fix available now (https://lore.kernel.org/linux-ext4/20221122115715.kxqhsk2xs4nrofyb@quack3/T/#ma65f0113b4b2f9259b8f7d8af16b8cb351728d20) and I’ve tested it on 5.15 with the reproducer in this thread, I can no longer reproduce the hang.
This should now get merged upstream and subsequently backported. Then it’ll be included in a Flatcar stable release.
And i’m working on getting it into 5.15.x.
https://lore.kernel.org/stable/1672844851195248@kroah.com/ https://lore.kernel.org/stable/1672927319-22912-1-git-send-email-jpiotrowski@linux.microsoft.com/
So the update is that the issue is also present in upstream 6.0 kernels, it appears to be a logic bug in this commit: https://github.com/torvalds/linux/commit/65f8b80053a1b2fd602daa6814e62d6fa90e5e9b, which results in processes getting stuck trying to reuse a block that has too many references.
I’ve provided more debug output to the kernel mailing list, now that the problem has been identified I hope this will lead to a fix soon.
Okay I think the patch provided above fixes the problem. Ive deployed two nodes, one running stock 5.15.63 and a second running with the patch. On the regular node without the patch, I could recreate this reliably by spawning 20-30 pods each pulling different images. Every test run would result in SSH connections dying and the node would be become completely unresponsive with most of the images stuck in a
ContainerCreating
state. On the node with the patch I ran the pull a couple of times, pruning the images between each run. No problems were encountered and the images all entered aRunning
state successfully, SSH connections were also responsive during this time.Didn’t mean to close this. It will land in stable in 1-2 release cycles (next time beta promotes), or when the bugfix lands in the stable trees.
If you’re hitting this issue and have been holding off on updates - do update to the beta release that will come this week.
State of our research so far:
The suspicious commit has not been reverted yet, and is still in main (therefore, also in the 5.15 LTS kernel series). We will continue to closely monitor the situation.
Kudos go to @vbatts and to @jepio for doing the leg work.
the fix is queued up for 5.15.87 https://lore.kernel.org/stable/5035a51d-2fb3-9044-7b70-1df33af37e5f@linuxfoundation.org/T/#m39683920478da269a295cc907332a5f20e6122f5
Correct. The fix is not in Stable yet, but only in Beta & Alpha.
Good news is, the ext4 deadlock fix was recently merged to the mainline Kernel. Though that is still not included in any Kernel release. Looks like it needs a little more time. Anyway most of Flatcar maintainers already left for holidays, so the next release should happen in Jan. in the next year.
Flatcar Stable 3374.2.3 was released with the bugfix, Kernel 5.15.86 with the backport. Please try it out, and see if the bug was gone.
The fix is now released with Kernel 6.1.4
Yesterdays release did not include the bugfix for this! was that intentional.
as i am really waiting for this! If that will take a longer time, i would need to temporary switch to the beta channel, but that is a lot of work to switch everything 😦
We have been testing the new patch [1] which we have applied against 5.15.80 in a custom build and it is currently looking stable again on our development clusters with Flatcar 3227.2.2
[1] https://lore.kernel.org/linux-ext4/20221122174807.GA9658@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/2-0001-ext4-Fix-deadlock-due-to-mbcache-en.patch
That’s too bad. On a positive note - i’m now able to recreate this with the reproducer you provided 😃
Turns out looking at the hung tasks only reveals half the picture because there are also some processes that are stuck running somewhere in ext4 code, i’ll post an update as soon as i understand better.
This is definitely a vanilla kernel issue, as Flatcar runs a vanilla kernel (the 6 patches we carry are arm64 specific or build system tweaks).
We agree this is pretty serious, but since none of us are filesystem developers, we don’t just want to go around reverting filesystem code patches, and risking the integrity of users’ data. We’re waiting on upstream kernel developers. In the mean time I suggest you freeze the Flatcar release that you’re using to one that is unaffected (an earlier stable or lts).
@jepio I think I have managed to get some of the list of hung tasks, I cant be sure how complete the list is, but I think it should be a good starting point nonetheless hung-tasks.txt
We tried to reproduce the case again today in our test cluster. We installed one node with flatcar version 3227.2.1 and the other with 3227.2.2.
Both nodes have the following hardware
We have used dd, stress as well as stress-ng, unsuccessfully! We had a high load, but the 3227.2.2 nodes did not crash.
Then we crafted the following K8S resource with a small Golang tool. And this actually managed to crash the 3227.2.2 node after about 5 minutes. SSH and other services were no longer accessible.
On the 3227.2.1 node the script ran until the filesystem reported full inodes, but it was still reachable.
When the 3227.2.2 node crashed, the kubelet process was marked as a zombie and in the kernel ring buffer we saw the following errors
cc (@damoon)
I assume that this issue was resolved in Stable 3374.2.3. Please create new issues for other bugs. Thanks.
Unfortunately there hasn’t been any response from the ext4 maintainer, I’ve poked them this week.
In the meantime we’ll likely take the backport into the next alpha/beta Flatcar release next week.
@Champ-Goblem thanks all the help with testing and reproducing, it’s good to have a datapoint confirming that no new issues pop-up with the patch.
I have rebuilt flatcar with the above PR, below should be the logs from the system during the failure.
Awesome, thank you very much for testing! We’re discussing this issue with Jan Kara, an Ext4 maintainer, upstream. (https://www.spinics.net/lists/linux-ext4/msg85417.html ff., messages from today will become available in the archive tomorrow). Unfortunately, there seems to be an issue with the stack traces we provided (from this very issue); we’re currently looking into possible causes and will follow up with the maintainer.
I have already started a build on our infra for the patched image, so that should be fine, it will probably take an hour and a bit for this to build. Ill look at getting it rolled out and start testing by the afternoon, so hopefully ill have some results either later today or tomorrow some time. Ill be off from Friday till Monday so it may also be worth you guys doing some testing on your end in case I am unable to get any results by then.
I was not able to repro with the reproducer in my own testing, but an infrastructure VM that we have on the stable channel has also hit this (14 days of uptime, no particular io load):
log:
We are still facing the issue with the current stable release 3374.2.3:
Beta 3432.1.0 is working fine for us.
No more tears with beta 3432.1.0.
Looking forward to the next stable channel release to solve this problem.
is this still not in stable? I’ve tought it already is and updated to
3374.2.1
and still get processes to hang.@databus23 We have worked around this by changing the root FS from Ext4 to XFS. Our tests look good so far but haven’t rolled it out to production yet. In case you need to upgrade flatcar for some reason, this might be an alternative idea for you until the ext4 fix has made it’s way up to a flatcar stable release.
@jepio @t-lo I will try and have a look at this over the next 2 days
We also experience this. Any ideas if and when this will be solved? Is this certain that’s vanilla kernel issue? If so, other distos should have the same problem as well. We’re already making decision to come back to LTS due this… but what if this comes to LTS too? This is pretty serious. 😕
@t-lo Yep I can give this a go, although I may not be able to provide results till some time next week
Any specific you executed for that? I shuffled around some 100 GB’s using
dd
and could not trigger the issue.The servers in our company have the same symptoms after an update to 3227.2.2.
We can reproduce it as soon as we generate a lot of IO on the hard disk.
Unfortunately I don’t see any errors in systemd / kernel ring buffer. Only
File /var/log/journal/<some-id>/system.journal corrupted or uncleanly shut down, renaming and replacing.
shortly after that the system stops responding.As a temporary solution we did a rollback to 3227.2.1 for now.
We are still seeing lock-ups of nodes with 3227.2.2 on a daily basis in our fleet but still fail to reproduce the error some what consistently. Affected nodes are hanging, login via ssh or console is just haging. We are always seeing stack traces of “hung task timeouts” that are io/fs related.