harvester: [BUG] Installation failed: "Could find partition device path for partition 6"

Describe the bug Installer reports fatal error during installation: screen_shot_2021-11-23_at_10 31 23_am

To Reproduce Steps to reproduce the behavior: Install Harvester with all default options. This issue is easier to reproduce on bare-metals.

Expected behavior

Support bundle

Environment:

Harvester ISO version: master
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): Fremont 1.46

Additional context This is the error message from the installer saying that it failed to partition the storage device. The main reason is the tool for performing disk partitioning “yip” sometimes can’t synchronize the latest disk partitioning layout, and the inconsistency is detected and thus stops the whole process.

To verify this issue, you could solely run yip to see if it could properly and consistently partitioning the device:

Boot the Harvester ISO and proceed with the installation until you finished configuring networking. Having network access to the machine would make things easier.
SSH into the machine with credential rancher/rancher. You could also switch to another virtual terminal using Ctrl-Alt-F2, but it would make things a bit harder, as we later need to copy data into the machine.
Switch to root user sudo -s

Copy the following data into a file named part-layout.yaml. It’s a partitioning layout for yip to execute:

stages:
  partitioning:
  - layout:
      add_partitions:
      - fsLabel: COS_OEM
        size: 50
        pLabel: oem
        filesystem: ext4
      - fsLabel: COS_STATE
        size: 15360
        pLabel: state
        filesystem: ext4
      - fsLabel: COS_RECOVERY
        size: 8192
        pLabel: recovery
        filesystem: ext4
      - fsLabel: COS_PERSISTENT
        size: 102400
        pLabel: persistent
        filesystem: ext4
      - fsLabel: HARV_LH_DEFAULT
        pLabel: longhorn
        filesystem: ext4
      device:
        path: /dev/sda  # Change this line if you have different disk
    name: Part layout

Run this command to wipe the disk /dev/sda first, then partition the disk with yip using the layout from last step:
```
# wipefs -af /dev/sda && yip -s partitioning part-layout.yaml
```
If partitioning succeeds, you should see messages like this:
```
INFO[0007] Finished yip file execution                   stage=partitioning stages=1 success=true
```
If failed, you would see messages similar to the screenshot.
You could repeatedly run the command from step 5, or write a bash script to try to reproduce the issue.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (11 by maintainers)

Most upvoted comments

positive test: yip version 0.9.25, verified 30 times, all succeed negative test: yip version 0.9.18, verified in 6th times

Test Information

Environment: qemu/KVM, SCSI bus, 150Gb disk
Harvester Version: master-7a54d362-head(positive test), master-d05a6cc7-head(negative test)

Verify Steps:

follow the Additional Context in https://github.com/harvester/harvester/issues/1583#issue-1062085248

lanfon72 on Dec 1, 2021

Please check this possibility:

In short: if the partition is not found after creation, try delay a few seconds and read again.

The main process of creating a new partition : sgdisk, partprobe, lsblk https://github.com/mudler/yip/blob/master/pkg/plugins/layout.go

 console.Run(fmt.Sprintf("sgdisk %s", strings.Join(opts[:], " ")))

 console.Run(fmt.Sprintf("partprobe %s", dev))

 console.Run(fmt.Sprintf("lsblk -ltnpo name,type %s", dev))

All of them are run in the same go routine, no delay.

For partprobe, it has a silent return when dev is not ready, but we don`t know.

https://github.com/bcl/parted/blob/master/partprobe/partprobe.c

static int
process_dev (PedDevice* dev)
{
	PedDiskType*	disk_type;
	PedDisk*	disk;

	if (!ped_device_open (dev))
		return 0;  --------------------------------------> no error is returned

Notice, if !ped_device_open (dev), it does not return an error. The new device may not be ready at the kernel side.

https://unix.stackexchange.com/a/521858

Talked a similar case:

..
Welcome to fdisk (util-linux 2.23.2).
..
WARNING: Re-reading the partition table failed with error 16: Device or resource busy. 
The kernel still uses the old table. The new table will be used at the next reboot or 
after you run partprobe(8) or kpartx(8) Syncing disks.

..

The author said:


“My suspicion was that my piped-in command stream was surfacing a timing issue in
 fdisk (that wouldn't be triggered by slower/manual input) 
 so I started sprinkling sleep commands to delay various inputs until the error went away. 
 The problem in my case was that the w was happening too soon after the new partition was defined.
A sleep 5 before the w results in consistent success:
“

w13915984028 on Nov 24, 2021

@lanfon72 Please see the Additional Context in the description.

johnliu55tw on Dec 1, 2021

@johnliu55tw as we can’t reproduce this on Provo’s bare metals, would you please provide positive/negative test cases so that we can verify this bug is fixed.

lanfon72 on Dec 1, 2021

if there is nothing in the udev queue it will be an instant return of 0 if there is something in the queue it will wait until those events are handled. In this case until partitions are refreshed probably. Shouldn’t take long if there are events, I tested this about 700 times in an automated script and there was no appreciable difference against an un-patched version. The margin of udev events being handled should be pretty fast, thus why this is difficult to reproduce and happens about 1 out of 5 times in a specific machine (we could not reproduce it in qemu or vbox with both slow and fast HDDs!)

Itxaka on Nov 29, 2021

yip 0.9.25 has been released which hopefully fully resolves this! https://github.com/mudler/yip/releases/tag/0.9.25

Itxaka on Nov 29, 2021