Re: [Linux-NTFS-Dev] NTFS resizer

bug-parted

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Linux-NTFS-Dev] NTFS resizer

From:	Anton Altaparmakov
Subject:	Re: [Linux-NTFS-Dev] NTFS resizer
Date:	Thu, 16 Aug 2001 20:57:08 +0100

At 09:25 16/08/01, Andrew Clausen wrote:

On Thu, Aug 09, 2001 at 02:04:37AM +0100, Anton Altaparmakov wrote:
> At 01:18 09/08/2001, Andrew Clausen wrote:
> >On Wed, Aug 08, 2001 at 12:45:22PM +0100, Anton Altaparmakov wrote:
> > > You have a structure: the mft record. It contains all attributes nicely
> > > sorted.
> >
> >It contains the on-disk attributes, not ntfs_attr.
>
> That's my point. There should be no such thing as ntfs_attr. The on disk
> representation is entirely sufficient to do everything.

I guess this is ok with lots of accessor functions, etc.
(These are necessary for endianness, and making things like
accesing the name easier)

Obviously all accesses have to be wrapped with leXY_to_cpu() and viceversa. Bur that is nothing unusual. After all they just expand to nothingif compiled on a little endian architecture.

> >I think attributes need to be abstracted from MFT records, because
> >they may need to be shuffled around between MFTs.  (Eg: an attribute
> >gets inserted, or a run-list grows...)
>
> Then space is made for this attribute and it is inserted. mkntfs even goes
> as far as CREATING the attributes in their correct place in the mft record
> itself, i.e. no memcpy involved for that.
>

> This is the only way to catch when attributes become too big and needto be

> made non-resident or when they have to be moved out to other mft records.

Why "the only way"?

Well if you are not working within the mft record (i.e. you have copied theattributes into memory structures), you don't know how much space is in usein the mft record. You could argue that you don't care about this until youreach the stage of writing out to disk, but that doesn't really workbecause you have to differentiate resident vs. non-resident attributes inmemory, too: for resident ones your in memory attribute contains the valueof the attribute, while for non-resident ones the value is stored on diskand your in memory copy only holds the run list. You can't keepnon-resident attribute values in memory, because an attribute value caneasily be hundreds of megabytes or even gigabytes worth of data... Nobodyhas that much RAM and not even swap. (Note the current libntfs has a bigproblem here as it always loads the whole attribute value at once which isof course madness but that has been ok for me so far as the library is onlyused by ntfsfix which doesn't do anything too fancy with attribute valuesexcept for loading the whole of $LogFile, but that is limited to 4MiBmaximum size (IIRC) in normal circumstances so it is not too bad...)

BTW: when you need to move an attribute into
another mft record... you might be able to move it into a record with
lots of free space, etc.

When you are moving an attribute out of an mft record you move it to anempty mft record. It might be possible that you can have two attributes inthe same extension mft record but I have never seen that happen in Windows.Each attribute extent is in a separate mft record, so unless someone seesotherwise, I would operate on the assumption that Windows NTFS driver willchoke if you put two extents into the same mft record.

I would have thought this would be easy to do at "clean" time.

> >I was thinking this shuffling should happen when the file is synced
> >to disk.  (ntfs_file_sync()?)
>
> This is a possible approach and is indeed what the old NTFS driver does
> (very badly so it trashes your fs most of the time...). I hate that

> approach (but I would accept it, if it were written properly, which Ithink

> is extremely hard to do because you are creating one single mammoth
> function which you will have a lot of fun to debug or alternatively you
> will have millions of if/else or switch statement to capture all the
> special cases).

I like this approach for exactly the opposite reason: it has fewer
special cases, and is much more elegant (almost trivial!)

Not single mammoth, but single miniature function...

Maybe I'm not understanding something.

Well, you can look at the kernel NTFS driver (just pick any 2.4.x kernelout there, doesn't really matter, the concept is the same in all of them)if you want to see how complex it is and how it doesn't manage to get itright at all.

Alternatively, try to implement it! Perhaps my view is too biased againstthis approach and I am not objective. But AFAICS this cannot be done in asingle miniature function. If you manage to do that, I will take my hat offand bow down in awe... (((-;

> - It also has nasty side effect of resetting the attribute
> sequence numbers and even reshuffling them completely which is plain WRONG
> but considering we overwrite the journal with 0xff bytes not a too big
> problem.

why?  (BTW: the update is completely atomic)

Because every attribute in a mft record is assigned a sequence numberunique for that mft record at the time of creation of the attribute. It hasnothing to do with the order of the attributes in the mft record; it's justabout order of creation. If you keep the $LogFile (and $UsnJrnl for thatmatter but see below), the values in the journal(s) will conflict from thevalues your wrote out and if the journal gets replayed or unrolled you willend up with corruption. But never mind that. If the journal is stillpresent you will end up with corruption anyway...

(Why is shuffling wrong?)  I'm new to all of this!

Journalling only. So in fact as long as there is no journal left over itshould be ok.

> btw. We really need to delete the $Extend\$UsnJrnl file when
> writing to the partition or that could screw us badly, too, butdeletion is
> not implemented yet at all.

Why?

$UsnJrnl is very similar to $LogFile but it doesn't log data changes. So ifyou want, it is a "light" weight version of $LogFile for use byapplications. So backup programs, antivirus programs can look at it to seewhether a file has changed since a certain time and then they know whetherthey need to backup/scan the file again or not. - That's just an example ofwhat it can be used for. Obviously if you write to the partition thecontents of the log are out of date and to ensure data integrity you haveto deactivate the log. This happens in a different way from $LogFile (whichyou cannot delete as it is an essential system file) and in fact you justdelete the file to deactivate the $UsnJrnl. Windows will reactivate it onreboot (or the program wanting to use it will).

> For example if you are doing a new layout off all attributes from scratch
> you will need to check every single attribute for being a certain type and
> for whether it is allowed to be resident or non-resident or both, then you
> need to check whether there is still enough space in the mft record to add

> it and if not you have to start from scratch again or you have to editwhat

> you have done so far to make space.

Ah, the knapsack problem strikes again (run into this a bit in
partitioning!).  Starting "from scratch" is no worse.  It's NP hard either
way.


'NP'?

> - Now if you start editing what you
> have already created you end up having _all_ the functionality required to
> handle attributes in place in their mft records AND because you are doing
> the over the top flush function you have almost all the code duplicated
> there for doing all the checks etc.

I don't see why... it still seems rather trivial to me.

You are considering editing mft records trivial? Maybe our perspectives onwhat is defined as trivial differ... I guess the concept is trivial but toimplement all the functions required you are talking a lot of lines ofcode. (ntfs.sys binary in Windows is >500kiB for a reason... and it usesother drivers/kernel dlls extensively, too.)

It gets especially interesting when you start using extension mft recordsbecause you then require the attribute list attribute present but to havethat attribute you already need to know where all the attributes are andhow many extents each spans (otherwise you don't know what the size of theattribute list attribute value will be so you can't reserve space in themft record for it, so in turn you can't write the other attributes if youare not going to be editing the mft record (note the attribute listattribute is type 0x20 which means it comes between the standardinformation and the file names so most attributes are written after it iswritten), sounds like Catch-22 situation to me if you are trying to writeout all attributes into a mft record in one go). This also raises thequestion whether you will maintain an attribute list attribute in memory orwhether you will create it on write to disk when all the other attributesare generated and written to the mft record? - I would like to know whatconcept will you follow to handle attribute list attributes? I don't seeany straightforward way to do it if you have all the attributes in separatein memory attributes rather than in the mft record...

For NTFS TNG I am thinking of making attribute list attribute handling partof a slow code path because it is very infrequent to have them. But on theother hand it is most often $MFT itself that uses attribute lists (due togrowing fragmentation with increasing age of the volume) which needs to beaccessible real fast, so to be honest I am not actually too sure how theywill get implemented eventually. I am ignoring the problem for the momentuntil I get a working driver without support for those beasts and willthink about them then...

> What about the mft record then? I mean when you are writing back which mft
> record will you write to? The same one (you have to otherwise you would
> have to release the previous one and allocate a new one...)? How will you
> know which one that was?

No problem since when writing to the attribute, there is no allocation.
So, when you rearrange the MFT's MFT records, they just move, but this
doesn't hinder writing the MFT's MFT records.

That is not that easy. Your "just move" is a complex set of operations:Each mft record is allocated in the mft bitmap (a bitmap having a bit foreach existing mft record and if a bit is 1 the corresponding mft record isin use and if bit is 0 mft record is free). So to move an mft record youneed to deallocate the old one and allocate the new one. Further, if youmove a file from one mft record to another, all directory entries pointingto this file have to be updated with the new mft record. Same for allextension mft records which need to point back to the base mft record.Thus, IMHO, moving a files'mft record about is a bad idea unless you absolutely have to do so. Add tothat that Unix/Linux utilities expect inode numbers to be persistent acrossmounts which a moving about of mft records on every write would screw up soutilities like tar for example would not work properly.

Also, when writing to an attribute, while there is no allocation of mftrecords, there will be allocation of disk clusters for non-residentattributes. Again, you can't just keep the non-resident attributes inmemory due to the possibly huge size. [Maximum attribute size for $DATAattribute of any mft record is 2TiB...]

> Also, surely parted will not be working at file level but much deeperbelow

> in the inode/mft record level? Or will it not treat files as opaque
> structures and use them to access the underlying mft records?

Well inode == file NOT mft, IMHO.  It should work at the inode level, yes.


Terminology I have used so far is:

file = abstract file system object which has a name (dentry) which isconnected to an inode. Programs and processes work with files and thekernel uses them to keep track of what files are open, etc.

inode = kernel object underneath the concept of files. Several files canpoint to the same inode (hard links, where multiple dentries point to thesame inode) and in NTFS the reverse is also true that several inodes canmake up one file. One specific inode is in my definition the in memoryequivalent of one specific mft record of an NTFS volume.

Of course if you define a file to be an inode then it becomes clear we weretalking past each other...


Basically your definition of inode is my definition of file. (-;

And in your definition you take away the intermediate object between myfile and the mft record (i.e. my inode).

But if you do that how can you work on in your definition of files? I wouldhave thought that parted is a low level program which doesn'topen/read/write/close files but instead operates on the on disk structuresthemselves. - At least in my understanding a high level interface likeopen/read/write/close-file does not allow you to specify where a file'sdata is written to on disk.

> For example if the resize requires some data to be moved because it would

> be left in unallocated space otherwise, how would you do that? Youneed low

> level control of cluster allocations, file level access is useless in this
> context.

Well, in the first pass (traversing all inodes), it marks all blocks
(in a big in-memory bitmap) that need to be copied/relocated.  (This
includes the above-mentioned blocks, and also the MFT, for doing the
atomic update trick)

Be very careful here. The in-memory bitmap which is capable of containingone bit for each cluster on a large volume can in itself be very large. Weare talking 2.5MiB for my 10GiB NTFS partition at home and recently I haveseen people use up to 80GiB NTFS partitions, which would mean a bitmap sizeof 20MiB. Next year (or whenever) I would not be surprised to see peoplewith 800GiB partitions (requiring a bitmap of 200MiB).

So, when copying blocks to free space, we need to allocate clusters, yes.
No big deal.

A file based interface (you said file==inode) is not compatible withcluster allocation... Again I might be misunderstanding you. If you definewhat your file/inode will look like it will make it easier for me tounderstand what you have in mind. For me, accessing a file is done like this:


f = open(filename);
seek(f);
read(f);
write(f);
close(f);

And nothing else you can do to it. At least that is how Unix/Linux see files...

If this is not the interface we are talking about but something like:

i = read_inode(mft_record_number);
relocate_inode(i);
(de)allocate_inode(i);
write_inode(i);
etc., then I agree, that's the right level to do it.

> Also you will need to rewrite every single run list on the volume by
> adding/subtracting a fixed delta to every LCN value. - You can't dothis at
> a file level access either.

File == inode.  Files/Inode's have attributes, not MFT records.

That is wrong. MFT records have attributes. The mft record is thefundamental file system unit. If this weren't the case it would not makesense to be localizing attributes from one file in the same mft record(apart from speed of access obviously).

> This is why I don't understand why you want to work on a file level...
>
> My getfile would look like:
> {
>          is buffer in cache? -> yes: return buffer

what is buffer?

I would like to keep a cache in memory storing mft records so I don't haveto keep reading/writing them from/to disk all the time. So when you open afile you would check whether the base mft record corresponding to the fileis in this cache already or not. In this context buffer would be the mftrecord read in from disk into a memory buffer. Admittedly in user space youdo not necessarily need to do this because the kernel is already cachingdevice accesses via the buffer cache (unless you are using raw io).

> my file sync would look like:
>
>          for (all mft records owned by file) {
>                  lock mft record cached copy()
>                  pre write mst_fixup buffer()
>                  write to disk()
>                  fast post write mft fixup buffer()
>                  unlock buffer()
>          }
>
> Simple, only 6 lines of code (minus error handling).

But doesn't handle run lists overflowing.

Run lists cannot overflow. I am not sure what you mean. In my scheme therun list is already compressed in the mapping pairs array inside the file'smft records so when they are flushed to disk, the compressed run list isflushed to disk with the mft record it is stored in.

The mapping pairs array is updated every time the run list is changed inmemory because when you change the run list it means that you must haveallocated clusters on disk (clusters are what the run list points to afterall) and that obviously involved updating the cluster bitmap of thepartition ($Bitmap system file) so if you don't modify the run list /mapping pairs array at the same time and assuming your cluster bitmap iscommitted to disk and then the computer crashes you end up with clusterswhich are allocated on disk but are not associated with anything which isnot a good idea.

> >I'm not convinced we want this [un]map() thing on records.  Records
> >aren't accessed on their own, except in the context of a file.  Files
> >are the atomic pieces... So, I think we should just have {read,write}
> >on records, and [un]map() on files, although I've called it get()
> >here.  (unmap() is just free() on a clean file)
>
> They are in my implementation... Files have nothing to mft records.
> Directories are mft records, too.

Directories are files.


Sorry, that is what I meant to write.

Maybe this is the source of our misunderstanding.
I'm saying file == inode == "set of MFT records".


Ok. Understand now.

I'll call it inode if that sounds better.  (Just, I thought that was
NTFS terminology... sorry!)

In NTFS terminology it gets very confusing because a MFT record is a FILErecord (that's the magic identifier of an MFT record). Thus, when you talkabout files it is not clear whether you talk about actual files (in mydefinition) or about mft records. I have hence personally adopted theUnix/Linux terminology of what a file and an inode is and have equated aninode to be an mft record for simplicity.

Technically, you are correct that file represents one inode which == "setof MFT records", but in my twisted mind I like to think of it as filerepresents one "base" inode (the base mft record), which can be pointing to"helper inodes" (extension mft records), each being ONE mft record.

I only do this because it makes working with mft records easier, at leastin my NTFS TNG implementation. Basically I integrate MFT records verytightly with in memory inodes, which means that my (un)map_mft_recordfunctions also modify the struct inode of the mft record (for lockingpurposes for example), thus my implementation requires an inode object foreach mft record. I just ignore the fact that all those "helper inodes" willnot be used by the kernel for anything at all, i.e. you can't open fileswith those inodes and things like that... Whether this approach issensible/efficient or not remains to be seen as I haven't implementedattribute lists and hence the use of extension mft records yet... - So myideas might turn out to be wrong, only time will tell. (-;

There is another thing: parted doesn't really need to have a full NTFSimplementation. The relocation of clusters and possibly of the mft recordsand other system files could be implemented optimized for this specificuse. For example you don't actually need to be able to create files ordelete them. You just need to be able to move mft records about and samefor clusters. At the same time you need to be able to update run lists,directory entries and other attributes. I think it would be much easier forparted to say read a mft record, modify the attributes to reflect its newposition and then write it to it's new place. While doing each update itwould go about updating all references to this file to reflect the changes.To move random clusters about you just need to copy over the dataunmodified, update the bitmap for the volume and update the run list of thedata attribute for the file. So you could get away with a much simplerimplementation of something like an integrated libNTFSresize/libNTFSmovewhich would be extremely specific to parted. That would be a lot less codethan having a full blown libntfs that can do anything and everything on theplanet. Just an idea how you development of parted for ntfs can be speededup a lot...


Anton

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Linux-NTFS-Dev] NTFS resizer, (continued)
- Re: [Linux-NTFS-Dev] NTFS resizer, Anton Altaparmakov, 2001/08/07
  - Re: [Linux-NTFS-Dev] NTFS resizer, Andrew Clausen, 2001/08/08
    - Re: [Linux-NTFS-Dev] NTFS resizer, Anton Altaparmakov, 2001/08/08
    - Re: [Linux-NTFS-Dev] NTFS resizer, Andrew Clausen, 2001/08/08
    - Re: [Linux-NTFS-Dev] NTFS resizer, Anton Altaparmakov, 2001/08/08
  - Re: [Linux-NTFS-Dev] NTFS resizer, Andrew Clausen, 2001/08/08
    - Re: [Linux-NTFS-Dev] NTFS resizer, Anton Altaparmakov, 2001/08/08
    - Re: [Linux-NTFS-Dev] NTFS resizer, Andrew Clausen, 2001/08/08
    - Re: [Linux-NTFS-Dev] NTFS resizer, Anton Altaparmakov, 2001/08/08
    - Re: [Linux-NTFS-Dev] NTFS resizer, Andrew Clausen, 2001/08/16
    - Re: [Linux-NTFS-Dev] NTFS resizer, Anton Altaparmakov <=

Prev by Date: Re: [patch] getlogin
Next by Date: RE: FileSystem Support (Rieser, JFS, EXT3, et al.)
Previous by thread: Re: [Linux-NTFS-Dev] NTFS resizer
Next by thread: ��Ĺ��ҵ��л��ͳ��
Index(es):
- Date
- Thread