bug-parted
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Evms] EVMS Conference call (04/10/01) minutes


From: Andrew Clausen
Subject: Re: [Evms] EVMS Conference call (04/10/01) minutes
Date: Wed, 25 Apr 2001 16:34:24 +1000

Andreas Dilger wrote:
> 
> Mark writes:
> > *) "Compatibility" volumes
> > After being unable to resolve a method of handling EVMS
> > metadata for ALL cases involving compatibility or
> > legacy volumes, the EVMS design has been changed in
> > regards to legacy volumes. Legacy volumes will now
> > not be required to have any additional metadata stored
> > on the media. EVMS will deal with them much the same
> > way Linux deals with them today and they will still
> > have the same problems as they do in Linux today. The
> > main problem being non-persistant device names. Only
> > EVMS volumes, with their metadata, will the naming
> > persistance.
> 
> One possibility to reduce the number of "pure compatibility"
> volumes you need to handle is to shrink the filesystem slightly
> to allow an EVMS metadata block to be placed at the end.  Only
> in strange circumstances (for filesystems at least) is the
> filesystem 100% full and can't be shrunk at least a little bit.

For FAT (on Windows), the size of the partition must equal
the size of the file system.  Otherwise, Windows will silently
destroy your file system.

Also, it means you have "2 versions" of the MSDOS partition
table, etc.  We can assume that most tools won't recognise
our new version, and I think this will lead to lots of big
problems.  (Remember we were talking about support before...
hehe)

> > EVMS will map all volumes under its major number.
> > Nodes for legacy volumes will be created to match
> > the name space from which they originated. For
> > example, a WINDOW volume may be mapped to
> > /dev/windows/c or /dev/windows/d. Standard Linux
> > volumes might be mapped to /dev/hda4 or /dev/sdb5.
> 
> Does this mean that compatibility volumes, despite having
> a name like /dev/hda4, will have a major number under EVMS?

I presume we will get rid of /dev/hda4.

> That may confuse software, and will also make us run out of
> device numbers faster.

We have plenty of numbers... (Linus is giving us more!)

> > However, since the block identifiers, for now, are
> > major/minor numbers, and these are not persistant
> > and can & will change under certain drive/partition
> > configuration changes, the nodes in the dev tree
> > may require updates when the configuration changes
> > occur.
> 
> One mistake that the current LVM has made (IMHO) is that
> even for LVM volumes, the major/minor may still change
> (even after a reboot!) because it regenerates the device
> nodes _each time_ vgscan is called.  If at all possible,
> the VG/LV major/minor numbers should _not_ change except
> after a major catastrophe.  Where LVM goes wrong is that
> in order to refresh VG configuration, you run "vgscan"
> which rebuilds all of the devices.  The proper way to do
> it (IMHO) is for something like "pvscan" to detect new
> PVs/partitions/disks/etc and then pass only the new ones
> to the configuration tools, leaving all of the existing
> VG/LV configuration alone.  There is no reason that we
> need to have a packed device number space.

Why is major/minor an issue?  Who uses major/minor?
I would have thought that /dev/XXX is the only thing that
needs to be persistent (via devfs or whatever).

> > *) Recoverability
> > The topic of volume data and metadata recoverability
> > was discussed. One member of the EVMS team was concerned
> > that decentralizing the location of the each feature's
> > metadata made the process of recoverability more
> > difficult. One idea that was proposed was to have a
> > userspace program that could backup and restore a
> > volume's metadata. Another approach was to provide
> > reduntancy and distribute redundant copies thru-out the
> > volume to increase the chances of find a readable copy
> > of the metadata.
> >
> > The group felt that both suggestions were good and
> > ideally would probably want both methods available.
> 
> If at all possible, having backup metadata at the device level
> is best.  The current LVM scheme only has backups inside the
> root filesystem, and this has been a constant source of user
> problems.  First of all, if the rootvg has a problem, the backup
> is not available.  Also, users need to do the restoration themselves,
> usually after a debugging session on the linux-lvm mailing list.
> Also a problem is that while the current LVM code has _some_
> redundancy of metadata at the device level, it doesn't check to
> see if it is consistent, leading to other sorts of problems.

We were talking about 2 types of "backup":
* redundancy.  each feature, etc. would be responsible for
this.
* backup to a "file"

You could make that file external (recover via nfs...), or
a partition, or whatever.

(perhaps this is an argument for LVM-on-partitions...?)

> AIX LVM did this totally correctly:
> - Each PV in a VG has a full copy of the VGDA (if only 1 PV in the
>   VG, it has 2 copies).  A quorum of VGDA copies must agree before
>   automatic VG activation is possible.
> - Each VGDA has a timestamp at the beginning and end (possibly even
>   at the beginning and end of each major data struct).  This ensures
>   that the data is known to be invalid if the beginning and end time
>   stamps don't match.  In this case, we _always_ have at least one
>   other copy (updates made synchronously in sequence) which is known
>   good.

What do the timestamps represent?  Last access, or something?
(If it wasn't part of the quorum, it doesn't get used, or something?)

> - The PVID (identifier) is never changed after pvcreate, so that it
>   is less likely to be corrupted.  The Linux LVM writes (bad, IMHO)
>   information like the _real_ device name and major/minor into the
>   PV struct on disk (NB: this is _not_ a symbolic name or LVM
>   major/minor number, but rather /dev/hda1 and 0x0301)!
> - For PVs missing from a VG (with no mirrored backup of the LE), you
>   could still use the rest of the LVs that were not affected by the
>   missing PV (because we had a full backup of the metadata on each disk).
> 
> If the IBM folks have access to the full AIX LVM metadata layout on disk
> (and are allowed to use it for EVMS mind you, watch for patents!), this
> would be an excellent starting point on how to set up EVMS.

Agreed.  (I suspect the *code* would be largely useless, due to
different APIs, etc.)

> > *) Metadata stored at the front vs back of a volume
> > Everyone agree that storing metadata at the back of
> > a volume was most desirable from a resizing stand
> > point. However, not everyone agree that it should be
> > the policy to enforce this.
> 
> Wouldn't it be the other way around?  Storing metadata at the _start_
> is better from a resizing POV.  This way you don't need to re-write
> the metadata when you resize a volume.  Storing it at the end is
> better from a compatibility standpoint (easier to shrink a filesystem
> slightly and then make it an EVMS volume).  This would also make it
> slightly safer from a "I did a bad thing and re-formatted/partitioned/
> dd if=/dev/zero of=/dev/hdX my raw device" standpoint, because these
> all destroy the start of the paritition and not the end).

The problem is when the size of the metadata is dependent on the
size of the volume.  (Or, when you want to add features).

When you grow, you also have to grow the metadata, which means
the actual data needs to get moved.  Since it is easier to
write file system resizers that can't move the start, there
will probably always be file systems that only have such
resizers.  (I'm only aware of my FAT resizer being able to
resize-the-start)

Actually, the data might not need to be moved, because with
LVM, it should be possible to "insert" more space at arbitary
locations in the volume.  But, the granularity of the inserted
space is usually the size of a physical extent... (although
this could probably be hacked, by discarding part of it...
but this probably complicates things too much)

So, I still think we need resize-the-start, for metadata
at the front.

> In the end, you want to be able to handle both.

Agreed.  I think this is fairly easy to do.  Just, some
things will require resize-the-start support.  I think
the bureaucracy around this is fairly easy, via Parted's
constraint solver.

> > *) UNDO capability
> > Everyone agree that virtual create is required and
> > that UNDO should be an implementation goal.
> 
> This is purely a user-interface issue, IMHO.

I don't think so.

UIs should show the user:
(1) what operations will be done
(2) how the system looks at the moment
(3) how the system will appear after the operations
(as they stand at the moment) are committed

Also, you need to "replay" the operations physicaly afterwards.
You shouldn't have to ask the same questions twice, etc.

Note: since it's REALLY [i.e. NP] hard to go from (2) to
(3) directly (in the case of partitions), the user needs to
have control over (1).  For LVM, I think it's straight-forward
to go from (2) to (3).  But there still needs to be
a representation of the operations, as the GUI (or some
intermediate layer) will need to order operations, because
there are some dependencies, for resizing, for example.

(basically, this should be an ordering of commits on
different objects.

This is all in-memory stuff... so you have an in-memory
representation of what the state of the disks will become,
and then provide a mechanism for committing the in-memory
representation to disk (via operations).

BTW: this stuff only applies to metadata.  If there is
on-disk DATA (eg: an existing file system with files, etc.)
then you hack a reference to this somehow.

So, representing the disk means understanding the disk,
which means it is no longer a front-end issue.

To implement undo, you basically have 2 options:  

OPTION 1 - inverse operations
--------
For each operation you apply, you store it's inverse
operation, to revert to the previous state.  (For example,
when deleting a logical volume, you store an operation
to create the logical volume)
        I think it's extremely difficult to come up with the
inverse operations, and it means your API has to have certain
properties (which are perhaps nice to have, but still hard
to design/implement):

* there must be an operation that allows complete
construction of an object (a file system, logical volume,
whatever).  We don't have this at the moment... for example,
you can't pass a UUID to mke2fs.

* handling references is a PITA.  If you delete two LV's that
where linked (linear raid), then how do you resurrect these
references?

* the operations must be completely non-interactive (eg,
no questions like "do you want to convert from fat16 -> fat32"),
because you need to apply each operation twice (first
"in-memory", then on disk)
        This probably isn't a disaster, because stuff
like this can be passed as parameters.  (And if you NEED
to make a choice, then you can fail, and ask them to
provide better parameters)


OPTION 2 - Checkpoints
--------
Each operation is implemented as a "change state of object
in memory" / commit().  So, you could do things like:

        ped_file_system_resize (fs, new_geom);
        ped_geometry_set_end (new_geom, new_geom->end + 1024);
        ped_file_system_resize (fs, new_geom);
        ped_file_system_commit (fs);

And only one resize would be executed.
ped_file_system_resize() would only modify the in-memory
super-block, or whatever.

Behind the scenes, you should save the state of the old
file system (or partition table, VG, or whatever), and
the new one.  So, you have the "on disk checkpoint",
and the "in memory" checkpoint.  (In the case of
ped_file_system_create(), you have no on-disk checkpoint)

However, when it comes to commit(), you may wish to
commit in steps.  This is important when doing partition
shuffles, for example.  So, you should be able to
push checkpoints, and be able to commit up to a checkpoint.

Also, undo, would just be removing checkpoints from the end.

This all sounds very nice in theory, but handling the
references (this checkpoint of file system X exists on
checkpoint Y of logical volume Z) gets hard to manage.
It may be doable... but it's a hard problem.  (Things
like ref-counting, etc...)

Anyway, back to your original point... the front end
needs to have pretty solid representations of what's
going on for undo.  It may be able to do this "on top"
of an EVMS API, but I think is easier to do it in the
EVMS API.  In any case, if it were to be done "on top",
the EVMS API would still have to have the "nice
properties" required for inverse operations to work.

Andrew Clausen



reply via email to

[Prev in Thread] Current Thread [Next in Thread]