stratisd: missing stratis pool after update to Fedora 39: thin_repair out of metadata space

Hello, I’ve just updated to Fedora 39 and have noticed that 1 of 3 stratis pools has disappeared from stratis management. How can I recover this data?

Running stratis version:

stratisd-3.6.3-1.fc39.x86_64
stratis-cli-3.6.0-1.fc39.noarch

The missing pool is the net.ejjohnson.home pool listed below in the stratis report result under partially_constructed_pools

$ stratis report
{
    "name_to_pool_uuid_map": {
        "net.ejjohnson.home": "7e18ddcd-9924-4c92-b926-100a7498630b"
    },
    "partially_constructed_pools": [
        {
            "devices": [
                {
                    "device_uuid": "5335c8c3-df0e-4f29-a241-edc58f384e21",
                    "devnode": "/dev/sdc",
                    "major": 8,
                    "minor": 32,
                    "pool_uuid": "7e18ddcd-9924-4c92-b926-100a7498630b"
                }
            ],
            "pool_uuid": "7e18ddcd-9924-4c92-b926-100a7498630b"
        }
    ],
    "path_to_ids_map": {
        "/dev/sdc": [
            "7e18ddcd-9924-4c92-b926-100a7498630b",
            "5335c8c3-df0e-4f29-a241-edc58f384e21"
        ]
    },
    "pools": [
        {
            "available_actions": "fully_operational",
            "blockdevs": {
                "cachedevs": [],
                "datadevs": [
                    {
                        "blksizes": "base: BLKSSSZGET: 512 bytes, BLKPBSZGET: 4096 bytes, crypt: None",
                        "in_use": true,
                        "path": "/dev/sdb",
                        "size": "7814037168 sectors",
                        "uuid": "5c9a2e88-cac0-4f38-a988-d72347894123"
                    },
                    {
                        "blksizes": "base: BLKSSSZGET: 512 bytes, BLKPBSZGET: 4096 bytes, crypt: None",
                        "in_use": true,
                        "path": "/dev/sda",
                        "size": "7814037168 sectors",
                        "uuid": "ec2d0e73-64fd-437a-ac4b-f5800248f44a"
                    }
                ]
            },
            "filesystems": [
                {
                    "name": "fs_raw",
                    "size": "4294967296 sectors",
                    "size_limit": "Not set",
                    "used": "885516140544 bytes",
                    "uuid": "e8071df3-346a-4753-bda1-524c84d037f9"
                }
            ],
            "fs_limit": 100,
            "name": "io.vos",
            "uuid": "8d86f3f6-8666-490b-99b0-b5c6a5fc7986"
        },
        {
            "available_actions": "fully_operational",
            "blockdevs": {
                "cachedevs": [],
                "datadevs": [
                    {
                        "blksizes": "base: BLKSSSZGET: 512 bytes, BLKPBSZGET: 512 bytes, crypt: None",
                        "in_use": true,
                        "path": "/dev/sdd",
                        "size": "250069680 sectors",
                        "uuid": "a3928e05-964a-4e65-8b76-5ea557ee6ff0"
                    }
                ]
            },
            "filesystems": [
                {
                    "name": "tmp",
                    "size": "2147483648 sectors",
                    "size_limit": "Not set",
                    "used": "2157969408 bytes",
                    "uuid": "8bec6004-cfe7-4820-99c3-1a827d37ba7f"
                }
            ],
            "fs_limit": 100,
            "name": "local.volatile",
            "uuid": "c70a5b86-fa15-45b3-a71e-4a9f78d1e340"
        }
    ],
    "stopped_pools": []
}

looking at the block devices the the missing pool should be on sdc:

$ lsblk
NAME                                                                                        MAJ:MIN RM   SIZE RO TYPE    MOUNTPOINTS
sda                                                                                           8:0    0   3.6T  0 disk    
└─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-physical-originsub                     253:12   0   7.3T  0 stratis 
  ├─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-flex-thinmeta                        253:13   0   7.4G  0 stratis 
  │ └─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-thinpool-pool                      253:15   0   7.3T  0 stratis 
  │   └─stratis-1-8d86f3f68666490b99b0b5c6a5fc7986-thin-fs-e8071df3346a4753bda1524c84d037f9 253:17   0     2T  0 stratis /mnt/io.vos_raw
  ├─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-flex-thindata                        253:14   0   7.3T  0 stratis 
  │ └─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-thinpool-pool                      253:15   0   7.3T  0 stratis 
  │   └─stratis-1-8d86f3f68666490b99b0b5c6a5fc7986-thin-fs-e8071df3346a4753bda1524c84d037f9 253:17   0     2T  0 stratis /mnt/io.vos_raw
  └─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-flex-mdv                             253:16   0    16M  0 stratis 
sdb                                                                                           8:16   0   3.6T  0 disk    
└─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-physical-originsub                     253:12   0   7.3T  0 stratis 
  ├─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-flex-thinmeta                        253:13   0   7.4G  0 stratis 
  │ └─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-thinpool-pool                      253:15   0   7.3T  0 stratis 
  │   └─stratis-1-8d86f3f68666490b99b0b5c6a5fc7986-thin-fs-e8071df3346a4753bda1524c84d037f9 253:17   0     2T  0 stratis /mnt/io.vos_raw
  ├─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-flex-thindata                        253:14   0   7.3T  0 stratis 
  │ └─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-thinpool-pool                      253:15   0   7.3T  0 stratis 
  │   └─stratis-1-8d86f3f68666490b99b0b5c6a5fc7986-thin-fs-e8071df3346a4753bda1524c84d037f9 253:17   0     2T  0 stratis /mnt/io.vos_raw
  └─stratis-1-private-8d86f3f68666490b99b0b5c6a5fc7986-flex-mdv                             253:16   0    16M  0 stratis 
sdc                                                                                           8:32   0   2.7T  0 disk    
└─stratis-1-private-7e18ddcd99244c92b926100a7498630b-physical-originsub                     253:3    0   2.7T  0 stratis 
  ├─stratis-1-private-7e18ddcd99244c92b926100a7498630b-flex-thinmeta                        253:4    0   2.8G  0 stratis 
  └─stratis-1-private-7e18ddcd99244c92b926100a7498630b-flex-thinmetaspare                   253:5    0    16M  0 stratis 
sdd                                                                                           8:48   0 119.2G  0 disk    
└─stratis-1-private-c70a5b86fa1545b3a71e4a9f78d1e340-physical-originsub                     253:6    0 119.2G  0 stratis 
  ├─stratis-1-private-c70a5b86fa1545b3a71e4a9f78d1e340-flex-thinmeta                        253:7    0   112M  0 stratis 
  │ └─stratis-1-private-c70a5b86fa1545b3a71e4a9f78d1e340-thinpool-pool                      253:9    0 119.1G  0 stratis 
  │   └─stratis-1-c70a5b86fa1545b3a71e4a9f78d1e340-thin-fs-8bec6004cfe7482099c31a827d37ba7f 253:11   0     1T  0 stratis /opt/volatile/tmp
  ├─stratis-1-private-c70a5b86fa1545b3a71e4a9f78d1e340-flex-thindata                        253:8    0 119.1G  0 stratis 
  │ └─stratis-1-private-c70a5b86fa1545b3a71e4a9f78d1e340-thinpool-pool                      253:9    0 119.1G  0 stratis 
  │   └─stratis-1-c70a5b86fa1545b3a71e4a9f78d1e340-thin-fs-8bec6004cfe7482099c31a827d37ba7f 253:11   0     1T  0 stratis /opt/volatile/tmp
  └─stratis-1-private-c70a5b86fa1545b3a71e4a9f78d1e340-flex-mdv                             253:10   0    16M  0 stratis 
sde                                                                                           8:64   1  57.3G  0 disk    /run/media/erick/Sandisk-Ultra
zram0                                                                                       252:0    0     8G  0 disk    [SWAP]
nvme0n1                                                                                     259:0    0 465.8G  0 disk    
├─nvme0n1p1                                                                                 259:1    0     1G  0 part    /boot
└─nvme0n1p2                                                                                 259:2    0 464.8G  0 part    
  └─luks-5f8f4e1c-e8f2-4329-bc24-f56bf78cf515                                               253:0    0 464.8G  0 crypt   
    ├─fedora-root                                                                           253:1    0   445G  0 lvm     /
    └─fedora-swap                                                                           253:2    0  15.7G  0 lvm     [SWAP]

About this issue

  • Original URL
  • State: closed
  • Created 6 months ago
  • Comments: 47 (24 by maintainers)

Commits related to this issue

Most upvoted comments

@erickj I’m pleased to hear that your pool is back up.

Regarding question (2) I believe that it will be safe for you to reinstall the current version of stratisd.

Regarding question (1), there are really three issues that affected you, in sequence. The first was that there were stray zeros in a particular region in the thin metadata on this pool only. The second was that the new version of thin_check detected these stray zeros, which it had not previously done. The third was that when stratisd ran thin_repair the target device on your pool was too small. I can not guess why those stray zeros appeared and it may be very hard to discover that. Regarding whether thin_check should report an error on this condition, I am uncertain, and @mingnus is best able to make that decision. Regarding the third problem, that the thin meta spare device is too small to be usable as a target: we are working on developing a remediation that will be safe and well tested and also a way of identifying this problem for any other users.

I expect we will close this issue in about a week assuming your pool continues well. I’ve opened a new issue[1] for the remediation task.

Thanks for your patience and clear communication around all of this.

[1] https://github.com/stratis-storage/project/issues/683

@mulkieran apologies for the late reply, I was unavailable yesterday.

Good news is that your suggestions above seem to have worked.

$ sudo thin_dump /dev/dm-16 > dm-16.xml
$ sudo thin_dump --repair /dev/dm-16 > repaired.dm-16.xml
$ diff dm-16.xml repaired.dm-16.xml                        # produces no diff as you thought
$ sudo thin_restore -i repaired.dm-16.xml -o /dev/dm-16
$ sudo stratis pool stop --uuid="7e18ddcd-9924-4c92-b926-100a7498630b"
$ sudo stratis pool start --uuid="7e18ddcd-9924-4c92-b926-100a7498630b"
$ stratis pool list --uuid 7e18ddcd-9924-4c92-b926-100a7498630b
UUID: 7e18ddcd-9924-4c92-b926-100a7498630b
Name: net.ejjohnson.home
Alerts: 1
     WS001: All devices fully allocated
Actions Allowed: fully_operational
Cache: No
Filesystem Limit: 100
Allows Overprovisioning: Yes
Key Description: unencrypted
Clevis Configuration: unencrypted
Space Usage:
Fully Allocated: Yes
    Size: 2.73 TiB
    Allocated: 2.73 TiB
    Used: 2.50 TiB

Remounting the filesystem has succeeded and the drive is accessible again. Thank you very very much for the help with this issue 🙏

Just a few remaining questions:

  1. Is there any other data (or anything else) that I can provide which would be of use to you to prevent this issue from affecting other users?
  2. Is it safe to update again to the mainline stratisd current version 3.6.3-1.fc39?
  3. If the issue does reappear (since your comments mentioned that stability may still be an issue until patches can be released), then is the presence of the journalctl logs previously analyzed sufficient to diagnose the identical issue, and expect the same steps to repair the pool again? Or would it be best to reopen a ticket to diagnose any future instability?

@mingnus re:

so I would like to rule out the possibilities of userland tools, unless you had tried any informal built by your own.

No, AFAIR no other tools have been used to manipulate the filesystem

Thanks for uploading the data. I filed a PR to address the missing long options upstream, thanks for mentioning that.

thank you for the follow up @mulkieran

thinmeta.pack.tar.gz file is attached (I needed to tar.gz it to upload a “github” supported file type)

additionally as an aside… I see something odd with the thin_metadata_pack command. --input is documented in the man page as you’ve given in your example. But I needed to use the short flag form of -i

$ sudo thin_metadata_pack --input /dev/mapper/stratis-1-private-7e18ddcd99244c92b926100a7498630b-flex-thinmeta --output thinmeta.pack
[sudo] password for erick: 
error: unexpected argument '--input' found

Usage: thin_metadata_pack [OPTIONS] -i <DEV> -o <FILE>

For more information, try '--help'.

@erickj The COPR package is masquerading as a pre-release of 3.6.0 3.6.3, so I believe if you update the package you will get the regular released package back. But it is quite acceptable to keep running with this test package, as, except for this change that we’re taking advantage of, it is indistinguishable, in its behavior, from the regularly released package.

@erickj What is happening is that the thin_check call failed, and consequently a thin_repair action was initiated. The thin repair action was making use of your backup metadata device and thin_repair seems to have reported that device as too small, almost certainly because it is too small (16 M).