grub-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH] getroot: Correctly handle missing btrfs device identifiers.


From: James Johnston
Subject: RE: [PATCH] getroot: Correctly handle missing btrfs device identifiers.
Date: Fri, 18 Mar 2016 04:48:12 -0000

Hi Andrei,

Thank you for considering and reviewing my patch.  I did a little more checking 
of the actual btrfs sources in kernel, and also userspace btrfs tools to try to 
address your concerns, since the ioctl interface is basically undocumented.

The scenario I am trying to address is a system such as my own, which reports 
the following:

$ btrfs fi show /
Label: 'SystemRoot'  uuid: 5c756382-8dc5-4afe-ac87-a604e9e7a9a2
        Total devices 2 FS bytes used 2.65GiB
        devid    2 size 5.60GiB used 3.53GiB path /dev/mapper/System1RootCrypt
        devid    3 size 5.60GiB used 3.53GiB path /dev/mapper/System0RootCrypt

btrfs-progs v4.0

As you can see: (1) there are only two devices, (2) they are numbered with 
devids #2 and #3.  There is no #1 device and if you ask for it with 
BTRFS_IOC_DEV_INFO, you get ENODEV.  (How did my system get this way?  I 
converted all the disk partitions to use LVM, one mirror at a time - this 
involved deleting/adding btrfs disks.  Somehow after this whole process I 
realized I had this new numbering scheme.)

Yet because there is no device #1, grub_find_root_devices_from_btrfs returns 
NULL.  This means that "grub-probe --target=device /" only prints one device, 
not two (and the one device is printed by fallback code that is called for when 
grub_find_root_devices_from_btrfs returns NULL).  That is the behavior I am 
trying to correct here.

> -----Original Message-----
> From: Andrei Borzenkov
> Sent: Thursday, March 17, 2016 17:12
> 
> 17.03.2016 18:41, James Johnston пишет:
> > From 97bcb9eb8c34329a8ca1ec88fd89296e273ceb24 Mon Sep 17 00:00:00 2001
> > From: James Johnston <address@hidden>
> > Date: Thu, 17 Mar 2016 15:10:49 +0000
> > Subject: [PATCH] getroot: Correctly handle missing btrfs device identifiers.
> >
> > The btrfs driver's BTRFS_IOC_FS_INFO ioctl only tells you the maximum
> > device ID and the number of devices.  It does not tell you which
> > device IDs may no longer be allocated - for example, if a device is
> > later deleted from the file system, its device ID won't be allocated
> > any more.
> >
> > You must then probe each device ID with BTRFS_IOC_DEV_INFO to see if
> > it is a valid device or not.  It is not an error condition for this
> > ioctl to return ENODEV: it seems to mean that the device was deleted
> > by the user.
> >
> 
> Kernel returns ENODEV if device was not found; we do not know why it was
> not found. We need some other way to verify that all devices are
> actually present to preserve current behavior.
> 
> Actually we probably need to handle redundant layout where non-essential
> devices are missing but that is another topic.

So, I did some investigating to see where the kernel returns ENODEV and it 
seems that it just returns this if there just isn't a btrfs device with that ID 
in the driver's list of devices.  Since the ioctl is not terribly helpful 
beyond ENODEV, we just have to dig into the sources to find out why. 

Note that it seems the driver list contains physically missing btrfs devices - 
if a btrfs device is physically missing, it is not cause to return ENODEV.

Here is the BTRFS_IOC_DEV_INFO ioctl:

https://github.com/torvalds/linux/blob/82b666eee71618b7ca812ee529af116582617dec/fs/btrfs/ioctl.c#L2746

As you can see, the function only returns ENODEV when btrfs_find_device returns 
NULL.  Let's examine that and find out:

https://github.com/torvalds/linux/blob/82b666eee71618b7ca812ee529af116582617dec/fs/btrfs/volumes.c#L6113

As we can see it just examines all devices for the file system and sees if any 
match the given devid (since we aren't comparing UUIDs).  Seems obvious enough.

What about if a device is missing?  Note that a device will still be in the 
list even if it is physically missing.  It just won't have a device path (i.e. 
null).

btrfs_find_device_missing_or_by_path function specifically returns devices 
marked missing:
https://github.com/torvalds/linux/blob/82b666eee71618b7ca812ee529af116582617dec/fs/btrfs/volumes.c#L2073

add_missing_dev does the obvious:
https://github.com/torvalds/linux/blob/82b666eee71618b7ca812ee529af116582617dec/fs/btrfs/volumes.c#L6133

You can search that file for the word "missing" to try to follow along and see 
that the driver does track missing devices.

In the userspace tools, we can look at btrfs-progs to see how they enumerate 
physical devices.  The load_device_info function enumerates devices in 
basically the same way I do in my patch - I tried to imitate the userspace 
tools, figuring I can't go wrong with that:
http://git.kernel.org/cgit/linux/kernel/git/kdave/btrfs-progs.git/tree/cmds-fi-usage.c?h=v4.4.1#n511
In this loop you can see that ENODEV is not considered a real error.  They just 
"continue" the loop to the next possible devid.  Also you see they loop to 
"fi_args.max_id" like how grub does, and just try all the possible IDs.

Noteworthy in the above btrfs userspace load_device_info function is that they 
handle physically missing devices:

                if (!dev_info.path[0]) {
                        strcpy(info[ndevs].path, "missing");
                } else {
                        strcpy(info[ndevs].path, (char *)dev_info.path);

You can see it when running "btrfs fi show /":

Data,RAID1: Size:3.00GiB, Used:2.53GiB
   /dev/mapper/System0RootCrypt    3.00GiB
   missing         3.00GiB

My patch does no such special check, because the original grub code didn't do a 
special check either.  In that regard, your request of "preserving current 
behavior" has been retained.  When grub-probe encounters a missing btrfs device 
(as opposed to an unallocated devid), I get this for "grub-probe -v 
--target=device /" after taking System1RootCrypt device offline:

./grub-probe: info: cannot open `/boot/grub/device.map': No such file or 
directory.
./grub-probe: error: failed to get canonical path of `'.

That's because our grub_find_root_devices_from_btrfs function didn't do any 
special handling for the null string, which signifies a missing device.

I'm not sure if this is really a desirable error to have happen.  But that's 
another issue for another patch I think.  (How would you like to see it 
handled?)

In summary:

1.  If the device is physically present, you get no error and the path is 
returned normally.
2.  If the device is absent (e.g. the file system was mounted degraded), you 
get a zero length path.
3.  If the device ID is not part of the file system to begin with, you get 
ENODEV.

If I'm getting any of the above wrong, please let me know. :)

> Why this temporary variable? Why you cannot check errno directly?
> 
> > +       {
> > +         grub_util_warn(_("btrfs device info could not be read "
> > +                          "for %s, device %d: errno %d"), dir, i,
> > +                          last_error);

Because I wanted to log it if there was a problem; this log message will 
hopefully help the next guy who has some problem with this function.

The "_" is presumably a call to gettext function and I don't know if gettext 
might do anything that would reset errno.  The safe thing is just to save it to 
a temporary and then call gettext via "_".

Best regards,

James Johnston





reply via email to

[Prev in Thread] Current Thread [Next in Thread]