bug-parted
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Linux-NTFS-Dev] NTFS resizer


From: Anton Altaparmakov
Subject: Re: [Linux-NTFS-Dev] NTFS resizer
Date: Thu, 16 Aug 2001 20:57:08 +0100

At 09:25 16/08/01, Andrew Clausen wrote:
On Thu, Aug 09, 2001 at 02:04:37AM +0100, Anton Altaparmakov wrote:
> At 01:18 09/08/2001, Andrew Clausen wrote:
> >On Wed, Aug 08, 2001 at 12:45:22PM +0100, Anton Altaparmakov wrote:
> > > You have a structure: the mft record. It contains all attributes nicely
> > > sorted.
> >
> >It contains the on-disk attributes, not ntfs_attr.
>
> That's my point. There should be no such thing as ntfs_attr. The on disk
> representation is entirely sufficient to do everything.

I guess this is ok with lots of accessor functions, etc.
(These are necessary for endianness, and making things like
accesing the name easier)

Obviously all accesses have to be wrapped with leXY_to_cpu() and vice versa. Bur that is nothing unusual. After all they just expand to nothing if compiled on a little endian architecture.

> >I think attributes need to be abstracted from MFT records, because
> >they may need to be shuffled around between MFTs.  (Eg: an attribute
> >gets inserted, or a run-list grows...)
>
> Then space is made for this attribute and it is inserted. mkntfs even goes
> as far as CREATING the attributes in their correct place in the mft record
> itself, i.e. no memcpy involved for that.
>
> This is the only way to catch when attributes become too big and need to be
> made non-resident or when they have to be moved out to other mft records.

Why "the only way"?

Well if you are not working within the mft record (i.e. you have copied the attributes into memory structures), you don't know how much space is in use in the mft record. You could argue that you don't care about this until you reach the stage of writing out to disk, but that doesn't really work because you have to differentiate resident vs. non-resident attributes in memory, too: for resident ones your in memory attribute contains the value of the attribute, while for non-resident ones the value is stored on disk and your in memory copy only holds the run list. You can't keep non-resident attribute values in memory, because an attribute value can easily be hundreds of megabytes or even gigabytes worth of data... Nobody has that much RAM and not even swap. (Note the current libntfs has a big problem here as it always loads the whole attribute value at once which is of course madness but that has been ok for me so far as the library is only used by ntfsfix which doesn't do anything too fancy with attribute values except for loading the whole of $LogFile, but that is limited to 4MiB maximum size (IIRC) in normal circumstances so it is not too bad...)

BTW: when you need to move an attribute into
another mft record... you might be able to move it into a record with
lots of free space, etc.

When you are moving an attribute out of an mft record you move it to an empty mft record. It might be possible that you can have two attributes in the same extension mft record but I have never seen that happen in Windows. Each attribute extent is in a separate mft record, so unless someone sees otherwise, I would operate on the assumption that Windows NTFS driver will choke if you put two extents into the same mft record.

I would have thought this would be easy to do at "clean" time.

> >I was thinking this shuffling should happen when the file is synced
> >to disk.  (ntfs_file_sync()?)
>
> This is a possible approach and is indeed what the old NTFS driver does
> (very badly so it trashes your fs most of the time...). I hate that
> approach (but I would accept it, if it were written properly, which I think
> is extremely hard to do because you are creating one single mammoth
> function which you will have a lot of fun to debug or alternatively you
> will have millions of if/else or switch statement to capture all the
> special cases).

I like this approach for exactly the opposite reason: it has fewer
special cases, and is much more elegant (almost trivial!)

Not single mammoth, but single miniature function...

Maybe I'm not understanding something.

Well, you can look at the kernel NTFS driver (just pick any 2.4.x kernel out there, doesn't really matter, the concept is the same in all of them) if you want to see how complex it is and how it doesn't manage to get it right at all.

Alternatively, try to implement it! Perhaps my view is too biased against this approach and I am not objective. But AFAICS this cannot be done in a single miniature function. If you manage to do that, I will take my hat off and bow down in awe... (((-;

> - It also has nasty side effect of resetting the attribute
> sequence numbers and even reshuffling them completely which is plain WRONG
> but considering we overwrite the journal with 0xff bytes not a too big
> problem.

why?  (BTW: the update is completely atomic)

Because every attribute in a mft record is assigned a sequence number unique for that mft record at the time of creation of the attribute. It has nothing to do with the order of the attributes in the mft record; it's just about order of creation. If you keep the $LogFile (and $UsnJrnl for that matter but see below), the values in the journal(s) will conflict from the values your wrote out and if the journal gets replayed or unrolled you will end up with corruption. But never mind that. If the journal is still present you will end up with corruption anyway...

(Why is shuffling wrong?)  I'm new to all of this!

Journalling only. So in fact as long as there is no journal left over it should be ok.

> btw. We really need to delete the $Extend\$UsnJrnl file when
> writing to the partition or that could screw us badly, too, but deletion is
> not implemented yet at all.

Why?

$UsnJrnl is very similar to $LogFile but it doesn't log data changes. So if you want, it is a "light" weight version of $LogFile for use by applications. So backup programs, antivirus programs can look at it to see whether a file has changed since a certain time and then they know whether they need to backup/scan the file again or not. - That's just an example of what it can be used for. Obviously if you write to the partition the contents of the log are out of date and to ensure data integrity you have to deactivate the log. This happens in a different way from $LogFile (which you cannot delete as it is an essential system file) and in fact you just delete the file to deactivate the $UsnJrnl. Windows will reactivate it on reboot (or the program wanting to use it will).

> For example if you are doing a new layout off all attributes from scratch
> you will need to check every single attribute for being a certain type and
> for whether it is allowed to be resident or non-resident or both, then you
> need to check whether there is still enough space in the mft record to add
> it and if not you have to start from scratch again or you have to edit what
> you have done so far to make space.

Ah, the knapsack problem strikes again (run into this a bit in
partitioning!).  Starting "from scratch" is no worse.  It's NP hard either
way.

'NP'?

> - Now if you start editing what you
> have already created you end up having _all_ the functionality required to
> handle attributes in place in their mft records AND because you are doing
> the over the top flush function you have almost all the code duplicated
> there for doing all the checks etc.

I don't see why... it still seems rather trivial to me.

You are considering editing mft records trivial? Maybe our perspectives on what is defined as trivial differ... I guess the concept is trivial but to implement all the functions required you are talking a lot of lines of code. (ntfs.sys binary in Windows is >500kiB for a reason... and it uses other drivers/kernel dlls extensively, too.)

It gets especially interesting when you start using extension mft records because you then require the attribute list attribute present but to have that attribute you already need to know where all the attributes are and how many extents each spans (otherwise you don't know what the size of the attribute list attribute value will be so you can't reserve space in the mft record for it, so in turn you can't write the other attributes if you are not going to be editing the mft record (note the attribute list attribute is type 0x20 which means it comes between the standard information and the file names so most attributes are written after it is written), sounds like Catch-22 situation to me if you are trying to write out all attributes into a mft record in one go). This also raises the question whether you will maintain an attribute list attribute in memory or whether you will create it on write to disk when all the other attributes are generated and written to the mft record? - I would like to know what concept will you follow to handle attribute list attributes? I don't see any straightforward way to do it if you have all the attributes in separate in memory attributes rather than in the mft record...

For NTFS TNG I am thinking of making attribute list attribute handling part of a slow code path because it is very infrequent to have them. But on the other hand it is most often $MFT itself that uses attribute lists (due to growing fragmentation with increasing age of the volume) which needs to be accessible real fast, so to be honest I am not actually too sure how they will get implemented eventually. I am ignoring the problem for the moment until I get a working driver without support for those beasts and will think about them then...

> What about the mft record then? I mean when you are writing back which mft
> record will you write to? The same one (you have to otherwise you would
> have to release the previous one and allocate a new one...)? How will you
> know which one that was?

No problem since when writing to the attribute, there is no allocation.
So, when you rearrange the MFT's MFT records, they just move, but this
doesn't hinder writing the MFT's MFT records.

That is not that easy. Your "just move" is a complex set of operations: Each mft record is allocated in the mft bitmap (a bitmap having a bit for each existing mft record and if a bit is 1 the corresponding mft record is in use and if bit is 0 mft record is free). So to move an mft record you need to deallocate the old one and allocate the new one. Further, if you move a file from one mft record to another, all directory entries pointing to this file have to be updated with the new mft record. Same for all extension mft records which need to point back to the base mft record. Thus, IMHO, moving a files' mft record about is a bad idea unless you absolutely have to do so. Add to that that Unix/Linux utilities expect inode numbers to be persistent across mounts which a moving about of mft records on every write would screw up so utilities like tar for example would not work properly.

Also, when writing to an attribute, while there is no allocation of mft records, there will be allocation of disk clusters for non-resident attributes. Again, you can't just keep the non-resident attributes in memory due to the possibly huge size. [Maximum attribute size for $DATA attribute of any mft record is 2TiB...]

> Also, surely parted will not be working at file level but much deeper below
> in the inode/mft record level? Or will it not treat files as opaque
> structures and use them to access the underlying mft records?

Well inode == file NOT mft, IMHO.  It should work at the inode level, yes.

Terminology I have used so far is:

file = abstract file system object which has a name (dentry) which is connected to an inode. Programs and processes work with files and the kernel uses them to keep track of what files are open, etc.

inode = kernel object underneath the concept of files. Several files can point to the same inode (hard links, where multiple dentries point to the same inode) and in NTFS the reverse is also true that several inodes can make up one file. One specific inode is in my definition the in memory equivalent of one specific mft record of an NTFS volume.

Of course if you define a file to be an inode then it becomes clear we were talking past each other...

Basically your definition of inode is my definition of file. (-;

And in your definition you take away the intermediate object between my file and the mft record (i.e. my inode).

But if you do that how can you work on in your definition of files? I would have thought that parted is a low level program which doesn't open/read/write/close files but instead operates on the on disk structures themselves. - At least in my understanding a high level interface like open/read/write/close-file does not allow you to specify where a file's data is written to on disk.

> For example if the resize requires some data to be moved because it would
> be left in unallocated space otherwise, how would you do that? You need low
> level control of cluster allocations, file level access is useless in this
> context.

Well, in the first pass (traversing all inodes), it marks all blocks
(in a big in-memory bitmap) that need to be copied/relocated.  (This
includes the above-mentioned blocks, and also the MFT, for doing the
atomic update trick)

Be very careful here. The in-memory bitmap which is capable of containing one bit for each cluster on a large volume can in itself be very large. We are talking 2.5MiB for my 10GiB NTFS partition at home and recently I have seen people use up to 80GiB NTFS partitions, which would mean a bitmap size of 20MiB. Next year (or whenever) I would not be surprised to see people with 800GiB partitions (requiring a bitmap of 200MiB).

So, when copying blocks to free space, we need to allocate clusters, yes.
No big deal.

A file based interface (you said file==inode) is not compatible with cluster allocation... Again I might be misunderstanding you. If you define what your file/inode will look like it will make it easier for me to understand what you have in mind. For me, accessing a file is done like this:

f = open(filename);
seek(f);
read(f);
write(f);
close(f);

And nothing else you can do to it. At least that is how Unix/Linux see files...

If this is not the interface we are talking about but something like:

i = read_inode(mft_record_number);
relocate_inode(i);
(de)allocate_inode(i);
write_inode(i);
etc., then I agree, that's the right level to do it.

> Also you will need to rewrite every single run list on the volume by
> adding/subtracting a fixed delta to every LCN value. - You can't do this at
> a file level access either.

File == inode.  Files/Inode's have attributes, not MFT records.

That is wrong. MFT records have attributes. The mft record is the fundamental file system unit. If this weren't the case it would not make sense to be localizing attributes from one file in the same mft record (apart from speed of access obviously).

> This is why I don't understand why you want to work on a file level...
>
> My getfile would look like:
> {
>          is buffer in cache? -> yes: return buffer

what is buffer?

I would like to keep a cache in memory storing mft records so I don't have to keep reading/writing them from/to disk all the time. So when you open a file you would check whether the base mft record corresponding to the file is in this cache already or not. In this context buffer would be the mft record read in from disk into a memory buffer. Admittedly in user space you do not necessarily need to do this because the kernel is already caching device accesses via the buffer cache (unless you are using raw io).

> my file sync would look like:
>
>          for (all mft records owned by file) {
>                  lock mft record cached copy()
>                  pre write mst_fixup buffer()
>                  write to disk()
>                  fast post write mft fixup buffer()
>                  unlock buffer()
>          }
>
> Simple, only 6 lines of code (minus error handling).

But doesn't handle run lists overflowing.

Run lists cannot overflow. I am not sure what you mean. In my scheme the run list is already compressed in the mapping pairs array inside the file's mft records so when they are flushed to disk, the compressed run list is flushed to disk with the mft record it is stored in.

The mapping pairs array is updated every time the run list is changed in memory because when you change the run list it means that you must have allocated clusters on disk (clusters are what the run list points to after all) and that obviously involved updating the cluster bitmap of the partition ($Bitmap system file) so if you don't modify the run list / mapping pairs array at the same time and assuming your cluster bitmap is committed to disk and then the computer crashes you end up with clusters which are allocated on disk but are not associated with anything which is not a good idea.

> >I'm not convinced we want this [un]map() thing on records.  Records
> >aren't accessed on their own, except in the context of a file.  Files
> >are the atomic pieces... So, I think we should just have {read,write}
> >on records, and [un]map() on files, although I've called it get()
> >here.  (unmap() is just free() on a clean file)
>
> They are in my implementation... Files have nothing to mft records.
> Directories are mft records, too.

Directories are files.

Sorry, that is what I meant to write.

Maybe this is the source of our misunderstanding.
I'm saying file == inode == "set of MFT records".

Ok. Understand now.

I'll call it inode if that sounds better.  (Just, I thought that was
NTFS terminology... sorry!)

In NTFS terminology it gets very confusing because a MFT record is a FILE record (that's the magic identifier of an MFT record). Thus, when you talk about files it is not clear whether you talk about actual files (in my definition) or about mft records. I have hence personally adopted the Unix/Linux terminology of what a file and an inode is and have equated an inode to be an mft record for simplicity.

Technically, you are correct that file represents one inode which == "set of MFT records", but in my twisted mind I like to think of it as file represents one "base" inode (the base mft record), which can be pointing to "helper inodes" (extension mft records), each being ONE mft record.

I only do this because it makes working with mft records easier, at least in my NTFS TNG implementation. Basically I integrate MFT records very tightly with in memory inodes, which means that my (un)map_mft_record functions also modify the struct inode of the mft record (for locking purposes for example), thus my implementation requires an inode object for each mft record. I just ignore the fact that all those "helper inodes" will not be used by the kernel for anything at all, i.e. you can't open files with those inodes and things like that... Whether this approach is sensible/efficient or not remains to be seen as I haven't implemented attribute lists and hence the use of extension mft records yet... - So my ideas might turn out to be wrong, only time will tell. (-;

There is another thing: parted doesn't really need to have a full NTFS implementation. The relocation of clusters and possibly of the mft records and other system files could be implemented optimized for this specific use. For example you don't actually need to be able to create files or delete them. You just need to be able to move mft records about and same for clusters. At the same time you need to be able to update run lists, directory entries and other attributes. I think it would be much easier for parted to say read a mft record, modify the attributes to reflect its new position and then write it to it's new place. While doing each update it would go about updating all references to this file to reflect the changes. To move random clusters about you just need to copy over the data unmodified, update the bitmap for the volume and update the run list of the data attribute for the file. So you could get away with a much simpler implementation of something like an integrated libNTFSresize/libNTFSmove which would be extremely specific to parted. That would be a lot less code than having a full blown libntfs that can do anything and everything on the planet. Just an idea how you development of parted for ntfs can be speeded up a lot...

Anton




reply via email to

[Prev in Thread] Current Thread [Next in Thread]