[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Ognyan's libpager changes

From: Neal H. Walfield
Subject: Ognyan's libpager changes
Date: Mon, 02 Aug 2004 11:06:52 -0400
User-agent: Wanderlust/2.8.1 (Something) SEMI/1.14.3 (Ushinoya) FLIM/1.14.3 (Unebigorymae) APEL/10.6 Emacs/21.2 (i386-debian-linux-gnu) MULE/5.0 (SAKAKI)

Hi Ognyan et al.,

As a start to getting back into Hurd hacking mode, I have begun to
review your ext2fs patch for large file system support.  At this
point, I have only reviewed the libpager changes in any detail.  I
want to work with you to develop a well thought out interface and
implementation that not only does the job but is elegant and
represents something with which I feel comfortable in advocating
acceptance into the mainline.

Before I offer my reactions to your changes, I want to quickly
summarize the approach that you have taken for the benefit of those
who are not that familar with your work.

The problem that this patch attempts to solve is how to page more than
2GB (the approximate maximum contiguous memory area in a Hurd task on
a 32-bit architecture after a program has been loaded) using libpager.
Currently, libdiskfs based file systems maps the underlying store into
their address space and page it (cf.
libdiskfs/diskfs_start_disk_pager:disk-pager.c and
e.g. ext2fs/pager.c).

When Mach requests a page (as the result of a page fault), libpager
translates this into a pager_read_page call serviced by the file
system.  After this page is read from the backing store, it is
returned to Mach which assumes management of it.  (As a side note, one
needn't use a pager to manage the backing store: the only user of the
disk pager is the file system which also serves the paging requests.
It seems as if on a page fault, we go to the kernel, have the kernel
make an up call to a different thread in the same process to read the
page which can all be done in a simple store_read function call.  This
is actually true.  The reason that the pager is useful is that there
is a limited amount of memory in the system and something has to
manage it.  Since Mach has global knowledge and the mechanisms in
place to manage physical memory, it makes the most sense to delegate
this task to it.)  An association between the page and the disk
block(s) becomes mostly fixed: Mach has a reference to the offset into
the memory object which when returned to the pager must be recognized
and flushed to the backing store and the file system has the whole
memory object mapped into the address space.  Thus, the maximum
partition size is limited to what can be mapped into the task's
address space.  The most obvious way to fix this is to increase the
address space.  As this is infeasible if we wish to support 32-bit
architectures, another approach must be found.

Ognyan fixes this dilemma using a two fold observation: first, only a
small portion of the disk is in the kernel's cache at any one time;
and second, the only user of the disk pager is the file system (files
are mapped to uses using file pagers).  As we can trust and control
the types of access to the disk pager, we provide a hash between the
cache contents and the partition contents and remap the pages as
requried.  In other words, we currently use a one-to-one mapping
between offsets in the memory object and the partition and we are
changing this such that offsets in the memory object correspond to
disk blocks in the partition only indirectly.  Specifically, we use a
hash table to convert memory offsets to blocks on the backing store
and vice versa.

In order to safely remap memory object offsets pointing to one set of
disk blocks to another we must be sure that Mach has no references to
the offset (in the form of a copy of the data)--if it did and
eventually flushed the data, it would go to the wrong place and
corrupt the file system--and that no users of the pager have any
references to the memory object offsets (in the form of a mapping or
an assumption that a given offset points to specific content).  Using
m_o_lock_request, we can force the kernel to flush potentially cached
data from memory regions that we want to reuse.  This is, however,
expensive and it means that the server imposes a page eviction scheme
on Mach which does not cooperate with Mach's eviction scheme (the way
in which it chooses regions to flush is necessarily different from
Mach's as it does not have access to the same type of information as
Mach).  The ideal situation, then, is to somehow work with Mach's
facilities to procide the required functionality.

When Mach returns pages to the pager, it indicates whether or not it
has kept a copy (it sets the kcopy parameter when calling
m_o_d_return).  Since we know when a region is in use by the file
system, these virtual memory areas can be trivially remapped.
However, by default, Mach drops pages which are not dirty.  Given a
sufficient amount of time, mappings which are dropped by Mach will
consume all of the virtual memory area available for the disk image
and leave us in the same situation as before.  Mach will, however,
also returns pages marked as precious.  If we mark all pages which we
want to potentially remap as precious, these will also be returned to
us.  This will allow us to remap all of the virtual memory in the

The actual remapping occurs in the file systems, e.g. ext2fs.
libpager, however, must provide a function to indicate that a region
has been evicted from the kernel cache and can potentially be remapped
(note that the file system still needs to assert that it does not have
a reference to the region).  Ognyan proposes that we use a call back
mechanism on a per page basis, pager_notify_pageout.  He has also
extended pager_read_page by adding a new flag which, if set by the
callee, indicates that when the returned page is evicted from the
kernel cache, the call back should be executed.  This, I feel, is an
excellent approach, my sense is, however, it isa bit overkill.  My
intuition suggests that a pager will either be fully hashed or employ
the old scheme (i.e. have a fixed mapping scheme for the life of the
pager).  If there are any exceptions, they will likely be small
(e.g. meta-data) and can be easily excluded.  That is, I believe the
added fine grainedness of selecting the call back on a per page basis
is only useful in cases where many pages are fixed or only a few are
remapped (but this latter scenario is not really useful, as far as I
can see).  As such, I advise that we remove this extra flexibility and
have pager_notify_pageout (which I prefer to call pager_notify_evict
as data can be paged out without being evicted in the case of flushing
dirty pages) called on a _per pager_ basis and indicate whether or not
it should be called in the struct pager and set by pager_create.  If
it is useful, we may also have a pair of get and set methods to toggle
it, however, the implementation of turning this feature on after a
pager has been executing for some time seems problematic: all pages
need to be marked as precious; I can't think of an easy way to do this
off hand.

Second,  Ognyan suggests that there is a race.  From his document
describing his changes, he says:

    There is an optimization in m_o_d_request when page is being paged
    out.  In the beginning of m_o_d_return, all pages being return are
    marked as PM_PAGINGOUT.  This mark is cleared after m_o_d_supply
    (which supplies page content to Mach) is called.  If m_o_d_request
    is called on page that is marked as PM_PAGINGOUT, this page is
    marked with PM_PAGEINWAIT, and m_o_d_supply inside m_o_d_return is
    not called for this page.  This is possible because neither of
    these functions hold pager->interlock during the whole execution
    of function.  This lock is temporarily unlocked during call to
    user callbacks pager_read_page and pager_write_page.
    So what is the implication of this optimization to our page
    eviction notification?  When page is paged out, we get notified
    and we can decide to reuse it.  After arranging disk_cache_info,
    etc, page is touched, but if this happens fast enough, the
    optimization is triggered and we get the old content!  Reading the
    page is "optimized" and pager_read_page is not called, but instead
    the content of old block is used.
    This is solved by marking flushed and synced pages (via
    pager_{flush,sync}{,_some} with PM_FORCEREAD.  (These functions
    call lock-object.c:_pager_lock_object which marks pages with
    PM_FORCEREAD if they are already marked with PM_NOTIFY_PAGEOUT.)
    In handling m_o_d_request, pages marked as PM_FORCEREAD are not
    optimized in this way.

My sense is that this problem can be easily avoided by not sending a
pager_notify_pageout in data-return.c if PM_PAGEINWAIT is set thereby
eliminating all of the hubbub with PM_FORCEREAD and the inefficiencies
which is introduces.  Or perhaps, I am missing some subtility.

Finally, when a pager_notify_evict is called on a page, the page is
potentially changed.  Hence any state associated with the page must
also be changed.  That is, its pagemap entry needs to be cleared
otherwise, a page which is marked PM_EIO is shows for the wrong page.

There are also some smaller details but these may change depending on
how the code changes in reaction to my above comments.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]