qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Qemu-block] [PATCH] qcow2: do lazy allocation of the L


From: Kevin Wolf
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH] qcow2: do lazy allocation of the L2 cache
Date: Fri, 24 Apr 2015 11:45:10 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Am 24.04.2015 um 11:26 hat Stefan Hajnoczi geschrieben:
> On Thu, Apr 23, 2015 at 01:50:28PM +0200, Alberto Garcia wrote:
> > On Thu 23 Apr 2015 12:15:04 PM CEST, Stefan Hajnoczi wrote:
> > 
> > >> For a cache size of 128MB, the PSS is actually ~10MB larger without
> > >> the patch, which seems to come from posix_memalign().
> > >
> > > Do you mean RSS or are you using a tool that reports a "PSS" number
> > > that I don't know about?
> > >
> > > We should understand what is going on instead of moving the code
> > > around to hide/delay the problem.
> > 
> > Both RSS and PSS ("proportional set size", also reported by the kernel).
> > 
> > I'm not an expert in memory allocators, but I measured the overhead like
> > this:
> > 
> > An L2 cache of 128MB implies a refcount cache of 32MB, in total 160MB.
> > With a default cluster size of 64k, that's 2560 cache entries.
> > 
> > So I wrote a test case that allocates 2560 blocks of 64k each using
> > posix_memalign and mmap, and here's how their /proc/<pid>/smaps compare:
> > 
> > -Size:             165184 kB
> > -Rss:               10244 kB
> > -Pss:               10244 kB
> > +Size:             161856 kB
> > +Rss:                   0 kB
> > +Pss:                   0 kB
> >  Shared_Clean:          0 kB
> >  Shared_Dirty:          0 kB
> >  Private_Clean:         0 kB
> > -Private_Dirty:     10244 kB
> > -Referenced:        10244 kB
> > -Anonymous:         10244 kB
> > +Private_Dirty:         0 kB
> > +Referenced:            0 kB
> > +Anonymous:             0 kB
> >  AnonHugePages:         0 kB
> >  Swap:                  0 kB
> >  KernelPageSize:        4 kB
> > 
> > Those are the 10MB I saw. For the record I also tried with malloc() and
> > the results are similar to those of posix_memalign().
> 
> The posix_memalign() call wastes memory.  I compared:
> 
>   posix_memalign(&memptr, 65536, 2560 * 65536);
>   memset(memptr, 0, 2560 * 65536);
> 
> with:
> 
>   for (i = 0; i < 2560; i++) {
>       posix_memalign(&memptr, 65536, 65536);
>       memset(memptr, 0, 65536);
>   }
> 
> Here are the results:
> 
> -Size:             163920 kB
> -Rss:              163860 kB
> -Pss:              163860 kB
> +Size:             337800 kB
> +Rss:              183620 kB
> +Pss:              183620 kB
> 
> Note the memset simulates a fully occupied cache.
> 
> The 19 MB RSS difference between the two seems wasteful.  The large
> "Size" difference hints that the mmap pattern is very different when
> posix_memalign() is called multiple times.
> 
> We could avoid the 19 MB overhead by switching to a single allocation.
> 
> What's more is that dropping the memset() to simulate no cache entry
> usage (like your example) gives us a grand total of 20 kB RSS.  There is
> no point in delaying allocations if we do a single big upfront
> allocation.

Report a bug against glibc? 19 MB is certainly much more than is
required for metadata managing 2560 memory blocks. That's something like
8k per allocation.

> I'd prefer a patch that replaces the small allocations with a single big
> one.  That's a win in both empty and full cache cases.

Or in cases where the object size is stored anyway (like in the qcow2
cache), we could just directly use mmap() and avoid any memory
management overhead in glibc.

Kevin

Attachment: pgpjPIDR5b7hQ.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]