rook: unexpected partition on disks >= 1TB (atari partitions)

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

vdb    252:16   0    3T  0 disk 
├─vdb2 252:18   0   48G  0 part 
└─vdb3 252:19   0  6.1M  0 part 

Expected behavior:

vdb    ceph_bluestore

How to reproduce it (minimal and precise):

  1. An empty data disk >= 1TiB
  2. Prepare disk, using following script https://github.com/rook/rook/blob/master/Documentation/ceph-teardown.md#zapping-devices until now there still no child partitions in vdb, looks good.
  3. Follow Quickstart.md
  4. Then there will be 2 OSDs on this disk, on in vdb, another in vdb2 which size is 48GiB.

Workaround [added by @BlaineEXE]

If users wish to work around this issue, my latest suggestion is to partition any device greater than 875GB with a GPT partition table and create a single partition spanning the whole device. The GPT partition at the beginning of the disk should stop Linux from observing any “phantom” Atari partitions, and users will not have to destroy raw disk OSDs later to be re-created as LVM.

Environment:

  • OS (e.g. from /etc/os-release): Ubuntu 20.04
  • Kernel (e.g. uname -a): Linux master-11 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Rook version (use rook version inside of a Rook Pod): chart version 1.6.2
  • Storage backend version (e.g. for ceph do ceph -v): v15.2.11
  • Kubernetes version (use kubectl version): k3s+1.19

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 39 (22 by maintainers)

Commits related to this issue

Most upvoted comments

Hi there 👋

I’m facing the same issue, but with an on-premise deployment. (Obviously i can’t reduce the size of my physical disks).

After debugging for days, i think that i’ve found the root cause.

I saw on my kernel.log this message :

[  208.192028]  sdh: AHDI sdh2 sdh3 sdh4

And i found on the kernel this :

root@wanda:~/kernel/linux-5.4.0# grep -rni 'AHDI' *
block/partitions/atari.c:43:	int part_fmt = 0; /* 0:unknown, 1:AHDI, 2:ICD/Supra */
block/partitions/atari.c:73:	strlcat(state->pp_buf, " AHDI", PAGE_SIZE);

It seems that our partitions are recognized as “ATARI” partition by the kernel.

You can find on your servers/instances if ATARI partitions are supported by the command below :

$ cat /boot/config-$(uname -r) | grep -i atari
CONFIG_ATARI_PARTITION=y

After recompiling my kernel with this function disabled the bug seems gone 🎉

I saw this too : https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1908264

That means if you have this issue and you’re running on a cloud provider. A kernel with ATARI support disabled will be released soon

On CentOS (7 here) the ATARI support seems disabled :

$ cat /boot/config-3.10.0-1160.25.1.el7.x86_64 | grep -i ATARI
# CONFIG_ATARI_PARTITION is not set

Update on this: 3 of the fixes I have tried have failed. Even if Rook doesn’t allow Ceph to provision the “phantom” partitions with OSDs, ceph-volume still gets confused and reports only the phantom partitions in ceph-volume raw list. These phantom partitions are almost always 48GB in size (smaller than the actual disk), and they have malformed information. This will be a problem if any of the users disks change names (e.g., /dev/sda becomes /dev/sdc after a reboot), because the init container will not be able to recover and find the original disk.

This also means that any users who have been affected by the “phantom” Atari partition behavior should destroy and re-create their OSDs once the fix is in place in v1.6.8 so they get stable LVM OSDs back.

The fix I will implement will be to disable raw OSD provisioning on disks, reverting to Rook’s behavior in v1.5. Raw provisioning will still happen on partitions since partitions are not affected by this bug. Update: the fix disables raw mode on host-based OSDs entirely for simplicity of making sure the fix doesn’t introduce other bugs.

If users wish to work around this issue, my latest suggestion is to partition any device greater than 875GB with a GPT partition table and create a single partition spanning the whole device. The GPT partition at the beginning of the disk should stop Linux from observing any “phantom” Atari partitions, and users will not have to destroy raw disk OSDs later to be re-created as LVM.

@Krast76 @dalingng @preslavgerchev @zzswang @sethjones @Mutantt @hbahadorzadeh @linkvt please see my latest workaround advice above. Thank you all for your patience while I’ve been trying to mitigate this bug.

After two failed attempts to fix this issue, I decided to do a deep dive analysis:

I looked at the linux code (header & code) as well as the Atari AHDI disk spec (see PART II) and tracked down the source of our woes with a hexdump from an affected disk. The analysis is posted below.

Atari-Analysis.pdf

For us in Rook, I believe this means that we only have to worry about this scenario if the OSD is running on a disk (not a partition). If the OSD is running on a partition, the user will have to have created a partition table on the disk beforehand, and that will not be detected as Atari. We should be able to detect this case pretty easily: if a partition’s parent device has a bluestore header, then we should ignore the child. I am running a test soon to validate that Linux won’t detect Atari partitions on large partitions, but there is no reason for me to believe it will.

@Krast76 @dalingng @preslavgerchev @zzswang @linkvt @brian-fa @dalingng

Thank you all for your patience with and help uncovering this issue. It has been really bizarre. I believe #8200 will solve the issue, and we will backport it to have it released with Rook v1.6.7. I’ll keep you updated here once we have a test build for you to try and hopefully verify for us since we’ve been having trouble reproducing the issue ourselves.

I’m seeing this issue as well. There is a lot more data and context in this slack thread.

In my testing, the additional partitions show up during the prepare step. I know that doesn’t mean that it’s the prepare job or ceph-volume creating these partitions, but there is definitely a correlation.

I’m running Rook/Ceph in AWS. Each time I collected data, it was on a brand new EC2 instance with a brand new EBS volume. And like @zzswang, this only happens with disks >=1TB in size.

Also, I don’t think it was stated explicitly above: These extra 48GB partitions getting added to the cluster, results in a broken cluster. I’m not able to mount PVCs while these 48GB partitions/OSDs are in the cluster.

Thanks for any help on this! I went to production with Rook/Ceph with 900GB disks to workaround this issue, so I hope we can get to the bottom of this. Let me know what other data would be helpful to track this down.

Sorry @Mutantt and @sethjones. I think the correct workaround should be deviceFilter: ^sd[a-z]+$ to ensure it stops matching after the alpha characters. I updated the workaround recommendations posted previously.

If users wish to work around this issue, my latest suggestion is to partition any device greater than 875GB with a GPT partition table and create a single partition spanning the whole device. The GPT partition at the beginning of the disk should stop Linux from observing any “phantom” Atari partitions, and users will not have to destroy raw disk OSDs later to be re-created as LVM.

@Krast76 @dalingng @preslavgerchev @zzswang @linkvt @brian-fa @dalingng

It would be good to verify the fix (now built into the rook/ceph:v1.6.0.508.g037945a image) is working on the real-world error. I did my best to test this using manually-created atari partitions, but things can always slip through.

Hi @BlaineEXE, thanks for the update and thanks to the others here for finding the issue!

I have only one question that I didn’t see answered in the PR, what should I do with my existing OSDs right now, should I scratch them and reinstall them with rook 1.6.7 running? I setup a device filter to exclude partitions which prevented the additional partitions of this bug, but not sure if my existing OSDs are now good to go for the future.

If you are using that workaround, you should not need to scratch the OSDs. Only if Rook has created OSDs on both the main disk/partition and one of the “phantom” partitions will you need to scratch those OSDs.