hurd-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

repost of discussion about not mapping the whole disk


From: Marcus Brinkmann
Subject: repost of discussion about not mapping the whole disk
Date: Mon, 25 Mar 2002 00:49:46 +0100
User-agent: Mutt/1.3.27i

Hi,

this is the whole thread that I can come up with.  Somehow I thought there
was more :) but the other stuff were about LFS, and this was reposted here
before by Roland.

Rather than trying to summarize it, and maybe doing a bad job and losing a
lot of time over it, I am posting each mail in full, in the thread order.

Thanks,
Marcus

------------------------------------------------------------------------------
From: Roland McGrath <address@hidden>
Subject: making the filesystems not map the whole disk

Thomas has previously described how to go about adjusting the current
implementations of ufs and ext2fs to still use the whole-disk pager, but
make temporary mappings (with some caching of mappings) as needed for each
piece of metadata.  I have a different plan.

A while back I did some off-the-cuff calculations of the ext2fs metadata.
I don't recall the numbers, but I satisfied myself that a workable amount
of metadata (i.e. under a gigabyte or so) would cover up to terabytes of
total filesystem data.  The structure of the ufs metadata is fairly similar
and I think the proportions of metadata to filesystem data are comparable
to ext2fs, so the same thinking applies to both.

My plan is to still map all of the metadata into memory all at once.  But
rather than mapping the whole disk flat, I want to use multiple, more
intelligent pagers.  The code that uses metadata can continue to use simple
pointer variables as it does now for each contiguous chunk of metadata.
Based on the details of metadata layout, we pick how many different
mappings are required; each mapping can be from a different pager.  Rather
than providing whole pages directly off the disk, these pagers will collect
together the metadata blocks.  In cases where there is double-indirection
going on (I don't recall exactly what, but I remember ext2fs has some), the
indirections can be looked up by the pager's read/write routines instead of
by the code doing the metadata access.  So in fact the access patterns on
the front end can be made simpler, and the computations involved in
locating the metadata on the disk are cached implicitly via the pager.

We'll have to look in detail, but it might well be that the layout of most
or all of the metadata is simple enough that we can describe it as short or
repetitive store_runs and just use a common store-based pager
implementation for all of it.

------------------------------------------------------------------------------
Date: Fri, 2 Mar 2001 23:07:48 +0100
From: Marcus Brinkmann <address@hidden>

On Sun, Jan 07, 2001 at 01:59:25AM -0500, Roland McGrath wrote:
> Thomas has previously described how to go about adjusting the current
> implementations of ufs and ext2fs to still use the whole-disk pager, but
> make temporary mappings (with some caching of mappings) as needed for each
> piece of metadata.  I have a different plan.
> 
> A while back I did some off-the-cuff calculations of the ext2fs metadata.
> I don't recall the numbers, but I satisfied myself that a workable amount
> of metadata (i.e. under a gigabyte or so) would cover up to terabytes of
> total filesystem data.  The structure of the ufs metadata is fairly similar
> and I think the proportions of metadata to filesystem data are comparable
> to ext2fs, so the same thinking applies to both.
>
> My plan is to still map all of the metadata into memory all at once.  But
> rather than mapping the whole disk flat, I want to use multiple, more
> intelligent pagers.  The code that uses metadata can continue to use simple
> pointer variables as it does now for each contiguous chunk of metadata.
> Based on the details of metadata layout, we pick how many different
> mappings are required; each mapping can be from a different pager.  Rather
> than providing whole pages directly off the disk, these pagers will collect
> together the metadata blocks.

I try to evaluate this, but I can't find enough information on ext2fs. But I
think we have the superblock (which might be located at several positions,
so we probably need one pager just for that, or a fancy pager special casing
it), and one pager for the inode area (which I assume to be contiguous. If 
not...)

> In cases where there is double-indirection
> going on (I don't recall exactly what, but I remember ext2fs has some), the
> indirections can be looked up by the pager's read/write routines instead of
> by the code doing the metadata access.  So in fact the access patterns on
> the front end can be made simpler, and the computations involved in
> locating the metadata on the disk are cached implicitly via the pager.

I don't understand the "implicitely". Currently, we call getblk for each
file pager read and write. Surely the pager code›will cache the metadata
somewhere? When you say "cached impl. by the pager", do you mean not as
paged content, but somewhere else?

Here is a list of indirections in ext2fs (from the undeletion howto).
For large files, there are quite a number of indirect blocks.
How does a an indir pager for each file pager compare with "implicit
caching" (assuming I understood you correctly)?

  ·  The block numbers of the first 12 data blocks are stored directly
     in the inode; these are sometimes referred to as the direct blocks.

  ·  The inode contains the block number of an indirect block.  An
     indirect block contains the block numbers of 256 additional data
     blocks.

  ·  The inode contains the block number of a doubly indirect block.  A
     doubly indirect block contains the block numbers of 256 additional
     indirect blocks.

  ·  The inode contains the block number of a triply indirect block.  A
     triply indirect block contains the block numbers of 256 additional
     doubly indirect blocks.

Thanks,
Marcus

------------------------------------------------------------------------------
Date: Fri, 2 Mar 2001 19:05:38 -0500 (EST)
From: Roland McGrath <address@hidden>

> I try to evaluate this, but I can't find enough information on ext2fs. 

I didn't know anything about it before I started reading the source. :-)

> But I think we have the superblock (which might be located at several
> positions, so we probably need one pager just for that, or a fancy pager
> special casing it), and one pager for the inode area (which I assume to
> be contiguous. If not...)

There are several ways to go about it.  The way I think of it is that we
have the pair (pager, address) as the name space in which we place all the
metadata, and we can divide up that name space however is most convenient.
So perhaps a single pager will have different disjoint regions of its
memory object that are for different kinds of metadata.  For the
bottom-level metadata like the superblock and the group descriptors, we can
use the plain disk pager; we will only be mapping small regions of it.

> I don't understand the "implicitely". Currently, we call getblk for each
> file pager read and write. Surely the pager code›will cache the metadata
> somewhere? When you say "cached impl. by the pager", do you mean not as
> paged content, but somewhere else?

No, I mean as paged content.  For example, the inode tables and the
allocation bitmaps are found by looking at pointers stored in the group
descriptors; in `dino' from ext2fs.h you can see where it looks up the
group descriptor and then finds the inode table on disk from that.

What I have in mind for this example is a having a single flat mapped inode
table.  That is, a metadata pager where pages full of inodes are filled
from the appropriate disk locations; the pager looks up the group
descriptors and does the indirection when a pageful of inodes comes in,
rather than the active code doing the indirection for every inode.  It is
also possible to do a similar thing (at least in this case) by doing a
bunch of adjacent mappings from the disk pager, one for each group's
contiguous chunk of inodes; but that has problems if the chunks are not
fully page-aligned.  I prefer having it in the pager because it's more lazy
and cache-like rather than using the VM map entries as up-front constant
data structures (wired in the kernel, no less).

> For large files, there are quite a number of indirect blocks.

After I made those posts a while back, I thought more about my scheme.  I
think it works very nicely for everything except the indirect blocks.  I
have not thought about this very much lately.  But I can't think of
anything very different from what we do now that will work for indirect
blocks.  They are the same blocks as data blocks, which is to say a free
block will can be allocated as either a data block of an indirect block as
new ones are needed, regardless of what it was used for before it became free.

Actually, I did think of one thing before.  But I haven't really though it
through entirely and I'm somewhat dubious about it.  As I sit here, more of
that is coming back to me.  So I think I will ponder it a wee bit more and
then send a separate message.

------------------------------------------------------------------------------
Date: Fri, 2 Mar 2001 19:05:38 -0500 (EST)
From: Roland McGrath <address@hidden>

> I try to evaluate this, but I can't find enough information on ext2fs.

I didn't know anything about it before I started reading the source. :-)

> But I think we have the superblock (which might be located at several
> positions, so we probably need one pager just for that, or a fancy pager
> special casing it), and one pager for the inode area (which I assume to
> be contiguous. If not...)

There are several ways to go about it.  The way I think of it is that we
have the pair (pager, address) as the name space in which we place all the
metadata, and we can divide up that name space however is most convenient.
So perhaps a single pager will have different disjoint regions of its
memory object that are for different kinds of metadata.  For the
bottom-level metadata like the superblock and the group descriptors, we can
use the plain disk pager; we will only be mapping small regions of it.

> I don't understand the "implicitely". Currently, we call getblk for each
> file pager read and write. Surely the pager code will cache the metadata
> somewhere? When you say "cached impl. by the pager", do you mean not as
> paged content, but somewhere else?

No, I mean as paged content.  For example, the inode tables and the
allocation bitmaps are found by looking at pointers stored in the group
descriptors; in `dino' from ext2fs.h you can see where it looks up the
group descriptor and then finds the inode table on disk from that.

What I have in mind for this example is a having a single flat mapped inode
table.  That is, a metadata pager where pages full of inodes are filled
from the appropriate disk locations; the pager looks up the group
descriptors and does the indirection when a pageful of inodes comes in,
rather than the active code doing the indirection for every inode.  It is
also possible to do a similar thing (at least in this case) by doing a
bunch of adjacent mappings from the disk pager, one for each group's
contiguous chunk of inodes; but that has problems if the chunks are not
fully page-aligned.  I prefer having it in the pager because it's more lazy
and cache-like rather than using the VM map entries as up-front constant
data structures (wired in the kernel, no less).

> For large files, there are quite a number of indirect blocks.

After I made those posts a while back, I thought more about my scheme.  I
think it works very nicely for everything except the indirect blocks.  I
have not thought about this very much lately.  But I can't think of
anything very different from what we do now that will work for indirect
blocks.  They are the same blocks as data blocks, which is to say a free
block will can be allocated as either a data block of an indirect block as
new ones are needed, regardless of what it was used for before it became free.

Actually, I did think of one thing before.  But I haven't really though it
through entirely and I'm somewhat dubious about it.  As I sit here, more of
that is coming back to me.  So I think I will ponder it a wee bit more and
then send a separate message.

------------------------------------------------------------------------------
From: address@hidden (Thomas Bushnell, BSG)
Date: 02 Mar 2001 16:52:29 -0800

Roland McGrath <address@hidden> writes:

> There are several ways to go about it.  The way I think of it is that we
> have the pair (pager, address) as the name space in which we place all the
> metadata, and we can divide up that name space however is most convenient.
> So perhaps a single pager will have different disjoint regions of its
> memory object that are for different kinds of metadata.  For the
> bottom-level metadata like the superblock and the group descriptors, we can
> use the plain disk pager; we will only be mapping small regions of it.

Something like this was once how I started writing ufs.

I had a set of several pagers: 

* one per cylinder group for cylinder group structures (CG)
* one per fs for inode structures (DINODE)
* one per file for the single indirect blocks of the file (SINDIR)
* one per file for the double indirect blocks of the file (DINDIR)
* one per fs for the triple indirect blocks of all the files (TINDIR)
* one per file with the data of the file (FILE)

The superblocks were not paged at all, but read and written the way
they are now.

This turned out to be quite unworkable; there was amazing levels of
hair and interlocks between these pagers (to read a file page required
reading something from each other pager, potentially), with numerous
deadlock possibilities that could and did happen.

It was not possible to just have a single lock per file that blocked
new disk-block allocations, for example, but rather one per
pager-type: a lock for the sindir, a lock for dindir, a lock for
tindir, and so forth.

I fully understand the apparent elegance of treating all this with
different types of pagers.  However I would strongly strongly strongly
urge you to reconsider.  It's a horrid strategy; I should know, I
tried it.

We have lots of problems now with ext2fs because the disk pager and
the file pager both reperesent certain disk blocks (because the
allocation units on disk may be smaller than a page); this requires
hairiness to get right; serious bugs with much lossage have resulted.
If we go from two pagers types to six, the complexity will be eighteen
times worse.  

We can't map the entire disk, but we don't need to.  We just need a
cache of mappings.  We can't have one memory object for the entire
disk, but we don't need one.  The cache of mappings can just use a set
of memory objects which linearly span the disk.

Thomas

------------------------------------------------------------------------------
Date: Sun, 4 Mar 2001 23:16:04 -0500 (EST)
From: Roland McGrath <address@hidden>

I was not aware you had tried something along these lines before.
That must have been a very long time ago, before the librarification
of the servers, right?  That was before I had really started to read
and understand the filesystem code.

Of course, I appreciate the benefit of your experience in the prior
attempt.  But I would like to consider this in a lot more detail
before abandoning it.  The discussion is useful to me and Marcus
regardless of what solutions we actually implement in the end.  In
other words, you're going to have to talk me out of it at a finer
granularity. :-)

I'd like to leave aside indirect blocks for now and just talk about
what I'll call the "fixed metadata".  That is, super block, group
descriptors, bitmaps, and inodes.  I call this "fixed" metadata
because those disk blocks are permanently dedicated to their metadata
purposes and in no case are those blocks ever used for other purposes.
This is in contrast to indirect blocks, which lie in the general pool
of blocks that may be allocated and reallocated over time to data
blocks or to any of the flavors of indirect blocks.

It's easy for the pagers for the fixed metadata to never write any
blocks other than exactly the fixed metadata areas.  Likewise it's
easy for whatever other pagers (file pagers and whatever handles
indirect blocks) never to touch the fixed metadata blocks.  This
requires a moment of thought for filesystem blocks smaller than pages,
but I see no difficulty.  As I see it, the present complexities of
interaction come from the very fact that the memory object covers the
flat disk rather than only the sequence of metadata regions.  It seems
to me that having the fixed metadata pager(s) separate removes a
factor in your multiplication of complexity, rather than adding any.

As to synchronization.  Let's take it one piece of metadata at a time.

At the "bottom" are the superblock and the group descriptors.  The
pagers for these depend on nothing but the actual device io.  The only
data updated here is the summary information.  Those changes are never
sync'd except when the whole filesystem is in synchronous mode.

Next are the allocation bitmaps (block and inode bitmaps).  The pagers
for these depend on the group descriptors to locate the bitmap blocks.
These are modified and sync'd at exactly the same times as the summary
information, and no others.

Finally, the inodes.  (It seems to me this is the only kind of fixed
metadata that really presents any potential complications.)  The
pagers for inodes depend on the group descriptors to find the inode
blocks.  I guess there are lots of kinds of synchronization between
the inodes, the indirect blocks, and the data blocks of directories.
I don't really know enough about this yet.  But I am pretty confident
there are no "downward" synchronizations, i.e. between the inodes and
the rest of the fixed metadata (aside from fully synchronous mode).


My current thinking is to have four mapped regions of fixed metadata.
I am inclined to use four disjoint page-aligned regions of a single
memory object.  My rationale is that fewer memory objects means less
overhead.  I am presuming that a page-in for one page (e.g. a bitmap
page) that then faults on a different page from the same memory object
(the group descriptor) does not present any sort of deadlock issue.
If there is some characteristic of the external pager interface that
make this a concern, then it could as well be four separate memory
objects and it's all the same to me.

The first region is the summary information, i.e. the superblock and
group descriptors.

The second is the block allocation bitmap.  This presents a contiguous
region of virtual memory made up of all the groups' bitmap blocks.
Bitmap searches for free blocks can then use a flat loop irrespective
of the group boundaries.

The third is the inode allocation bitmap.  This is just like the block
allocation bitmap.

Finally, the inode table.  This is a contiguous region of all the
groups' inode blocks.  Now dino() is just &inodes[cache_id].

A fringe benefit of this plan is that you could support other-endian
filesystems by byte-swapping the metadata in page-in/page-out.

------------------------------------------------------------------------------
From: address@hidden (Thomas Bushnell, BSG)
Date: 27 Mar 2001 15:20:29 -0800

Roland McGrath <address@hidden> writes:

> I was not aware you had tried something along these lines before.
> That must have been a very long time ago, before the librarification
> of the servers, right?  That was before I had really started to read
> and understand the filesystem code.

Yes.

> Of course, I appreciate the benefit of your experience in the prior
> attempt.  But I would like to consider this in a lot more detail
> before abandoning it.  The discussion is useful to me and Marcus
> regardless of what solutions we actually implement in the end.  In
> other words, you're going to have to talk me out of it at a finer
> granularity. :-)

No problem.

> At the "bottom" are the superblock and the group descriptors.  The
> pagers for these depend on nothing but the actual device io.  The only
> data updated here is the summary information.  Those changes are never
> sync'd except when the whole filesystem is in synchronous mode.
> 
> Next are the allocation bitmaps (block and inode bitmaps).  The pagers
> for these depend on the group descriptors to locate the bitmap blocks.
> These are modified and sync'd at exactly the same times as the summary
> information, and no others.
> 
> Finally, the inodes.  (It seems to me this is the only kind of fixed
> metadata that really presents any potential complications.)  The
> pagers for inodes depend on the group descriptors to find the inode
> blocks.  I guess there are lots of kinds of synchronization between
> the inodes, the indirect blocks, and the data blocks of directories.
> I don't really know enough about this yet.  But I am pretty confident
> there are no "downward" synchronizations, i.e. between the inodes and
> the rest of the fixed metadata (aside from fully synchronous mode).

Yes, this is basically right; you can indeed arrange the pagers
hierarchically this way, such that pagers never block except on other
pagers further up, with the top level blocking on only disk.

But the point is that very little is gained by this kind of
cleverness.

It's easier to just say "I want to map this part of the disk here",
than have separate *pagers* to handle the mapping.  And you totally
save the hassle of even thinking about it.

So I totally agree that of course you don't want to map the entire
disk.  That was a stopgap, and it's overdue to remove it.
Corresponding to it is the single big disk pager.  But the single big
disk pager can just be split up into enough separate pagers to cover
the disk in a trivial linear array, and then map only the things from
it that you actually want, instead of mapping everything.

Nothing is gained by having one pager which maps cylinder groups,
another which maps inodes, etc.  It just creates extra complexity.  

Thomas

------------------------------------------------------------------------------
Date: Thu, 16 Aug 2001 03:36:09 +0200
From: Marcus Brinkmann <address@hidden>

Hi,

before attempting a summary for hurd-devel on the issue, there is something
that bugs me since Thomas made his proposal:

On Tue, Mar 27, 2001 at 03:20:29PM -0800, Thomas Bushnell, BSG wrote:
> Roland McGrath <address@hidden> writes:
> So I totally agree that of course you don't want to map the entire
> disk.  That was a stopgap, and it's overdue to remove it.
> Corresponding to it is the single big disk pager.  But the single big
> disk pager can just be split up into enough separate pagers to cover
> the disk in a trivial linear array, and then map only the things from
> it that you actually want, instead of mapping everything.

Did you any concrete calculations on how large a region individual pagers
should cover, and how many pagers would result from it?  And how many
distinct regions are active at any time?

For one, you can't make the regions too large, because then you will
constantly hit all of the regions (let's say, you have 64 regions, and 500
open files by different processes), then it is not unlikely that you are
trashing on pager creation/destruction just to serve all requests.

So, you want smaller regions.  Of course, if you make the regions too small,
you end up with a lot of pagers rarely used, and if you make them very small
(at the extreme: one pager per memory page), this becomes unbearable.

This seems to be obvious, and the naive solution is to use something
"inbetween".  However, my point is that indirect inode blocks are
usually interleaved with the data blocks for performant sequential reading
of the data.  The distance between indirect blocks on the disk depends on
the block size of course, but for a block size of 4096, this is 1024 blocks,
or 4 MB!  So, if you don't use ranges >= 8 MB, you can just as well use very
small ranges of 32kb - 64kb.  But if you use ranges >= 8 MB, only about 
128 ranges must be active at the same time, or trashing will occur.  And
each range will only cover two "interesting pages" anyway.
More than 100 random parallel accesses on a 10 GB disk might be unusual for
a single user system, but with a fileserver/webserver...

I think that a collecting pager would be more efficient here.  But I did
only do some rough calculation, and maybe I am not seeing the correct
argument for smaller/bigger ranges.

Thanks,
Marcus

------------------------------------------------------------------------------
From: Roland McGrath <address@hidden>
Date: Wed, 15 Aug 2001 22:17:01 -0400 (EDT)

I'm not really sure I'm following you.  You're referring to Thomas's
proposal, which is very simple to understand.  I think you may be confusing
the memory objects with the mapping windows.  In Thomas's plan, there is
one memory object for each naturally-aligned 4GB region of the disk.
However, there can be a flexible number of mapping windows and we can tune
the window size to whatever works well.  Every access to the disk would use
some interface (in place of diskfs_catch_exception) to secure a mapping
window that covers the region it needs.

At any rate, I think these details of that plan can be debated on
hurd-devel after you post a summary of the basic plan and Thomas's reasons
for favoring it.  I'm personally not that interested in thinking too hard
about these details, since I strongly prefer my opposing plan regardless.

------------------------------------------------------------------------------
Date: Sat, 3 Mar 2001 18:16:11 -0500 (EST)
From: Roland McGrath <address@hidden>

I was going to write up thoughts about indirect blocks but have not gotten
that coherent yet.  But I wanted to make this observation.  There are some
concerns that arise if and only if the filesystem block size is smaller
than page size.  By default mke2fs uses 1k blocks for filesystems up to
512MB and 4k blocks for larger ones.  So, in all likelihood any filesystem
with blocks smaller than page size will be small enough to map the whole
disk and we could use that approach when it works and a more complex
approach only when necessary.

-- 
`Rhubarb is no Egyptian god.' Debian http://www.debian.org address@hidden
Marcus Brinkmann              GNU    http://www.gnu.org    address@hidden
address@hidden
http://www.marcus-brinkmann.de



reply via email to

[Prev in Thread] Current Thread [Next in Thread]