[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Optimise memset on i386

From: Vladimir 'φ-coder/phcoder' Serbinenko
Subject: Re: [PATCH] Optimise memset on i386
Date: Fri, 25 Jun 2010 20:04:41 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20100515 Icedove/3.0.4

On 06/23/2010 11:38 PM, Colin Watson wrote:
> With this approach, one of the most noticeable time sinks is that
> setting a graphical video mode (I'm using the VBE backend) takes ages:
> 1.6 seconds, which is a substantial percentage of this project's total
> boot time.  It turns out that most of this is spent initialising
> double-buffering: doublebuf_pageflipping_init calls
> grub_video_fb_create_render_target_from_pointer twice, and each call
> takes a little over 600 milliseconds.  Now,
> grub_video_fb_create_render_target_from_pointer is basically just a big
> grub_memset to clear framebuffer memory, so this equates to under two
> frames per second.  What's going on?
> It turns out that write caching is disabled on video memory when GRUB is
> running, so we take a cache stall on every single write, and it's
> apparently hard to enable caching without implementing MTRRs.  People
> who know more about this than I do tell me that this can get
> unpleasantly CPU-specific at times, although I still hold out some hope
> that it's possible in GRUB.
On non-device memory GRUB should take advantage of cache. On MIPS
enabling/disabling cache is done by using a different address. So we
have all infrastructure necessary for differentiating
cacheable/non-cacheable is present. Enabling cache on video memory is
however more of a trouble. One of the reasons is that cache nmishandling
produces difficult bugs.
> However, there's a way to substantially speed things up without that.
> The naïve implementation of grub_memset writes a byte at a time, and for
> that matter on i386 it compiles to a poorly-optimised loop rather than
> using REP STOS or similar.  grub_memset is an inner loop practically by
> definition, and it's worth optimising.  We can fix both of these
> weaknesses by importing the optimised memset from GNU libc: since it
> writes four bytes at a time except (sometimes) at the start and end, it
> should take about a quarter the number of cache stalls.  And, indeed,
> measurement bears this out: instead of taking over 600 milliseconds per
> call to grub_video_fb_create_render_target_from_pointer (I think it was
> actually 630 or so, though I neglected to write that down), GRUB now
> takes about 160 milliseconds per call.  Much better!
> The optimised memset is LGPLv2.1 or later, and I've preserved that
> notice, but as far as I know this should be fine for use in GRUB; it can
> be upgraded to LGPLv3, and that's just GPLv3 with some additional
> permissions.  It's already assigned to the FSF due to being in glibc.
It's ok to use this code but be sure to mention its origin. It's also ok
to keep its license unless big divergeance is to be expected.

Did you test it on x86_64?
> +void *
> +grub_memset (void *s, int c, grub_size_t n)
> +{
> +  unsigned char *p = (unsigned char *) s;
> +
> +  while (n--)
> +    *p++ = (unsigned char) c;
> +
> +  return s;
> +}
This can be optimised the same way as i386 part, just replace stos with
a loop over iterator with a pointer aligned on its size.
> Thanks,

Vladimir 'φ-coder/phcoder' Serbinenko

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]