[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: msgmerge speedup: fstrcmp and diffseq improvements

From: Ralf Wildenhues
Subject: Re: msgmerge speedup: fstrcmp and diffseq improvements
Date: Sat, 20 Sep 2008 17:30:58 +0200
User-agent: Mutt/1.5.18 (2008-05-17)

Hi Bruno,

Thanks for your reply.

* Bruno Haible wrote on Sat, Sep 20, 2008 at 04:20:02PM CEST:
> > > It is all committed now.
> > 
> > Thanks.  Some change between then and now caused differences in the
> > coreutils po files, though.
> Yes, this is intentional.

OK, good to know.  I was afraid some bug had crept in.

> > The shortcut ratios are now (in percentage of remaining pairs taken;
> > averaged over all .po files in coreutils not needing iconv):
> Thanks for these figures. I'm adding them as comments to the code, to help
> future maintenance.


> > I tried a 2-gram and 4-gram bound but their hit rate was around 1% so
> > I discarded that again (for the 4-gram that was to be expected due to
> > the indexing already done).
> Btw, why 4-grams? Why not 3-grams or 5-grams? I have not benchmarked it.

Uh, the coding effort was rather low, as I just used shorts resp. ints
to hold the grams and sort them with qsort.  I didn't go for an
efficient data structure just to measure hit rates, so I didn't reuse
the msgl-fsearch code.  Also, at least when looking for a strict lower
bound, N-grams with larger N are not as efficient as a bound anyway,
judging from the formula.

> > > + #if CHAR_BIT <= 8
> > 
> > FWIW, CHAR_BIT cannot be smaller than 8.
> The intent is to disable this code when CHAR_BIT is 16 or 32.
> I could also have written '#if CHAR_BIT <= 10' but that sounds too much
> like nonsense.

Erm, what about 'CHAR_BIT == 8'?  I know this is really picky, but the
above simply looks to me like someone didn't know CHAR_BIT cannot be
smaller than 8.

> > FWIW2, not inlining the bound functions allows
> > - for less stack usage (in the frequency bound case), or
> We're near the end of the stack here: only compareseq and diag come below it.
> Anyway, 1 KB of stack space looks OK, the default stack size for threads being
> at least 16 KB.

OK.  Doesn't take into account cache locality though.

> > - the compiler to decide about it (all modern compilers can do that),
> But the compiler does not know that fstrcmp is called millions of time and
> that this piece of code needs to be optimized for speed rather than for space.

This assumes that inlining frequency_bound is actually a win.  That is
not clear at all: the larger function increases register pressure on
register-poor machines and may decrease cache locality; also, I'm not
sure function call overhead is even measureable compared to the amount
of work this function does.  Anyway it was in the low-1% range of total
time when gprof'ing unoptimized code.

> Does GCC meanwhile have a heuristic that forces inlining of a 'static'
> function that is only used once in the compilation unit?

Dunno; does adding 'inline' help?  If profile-based optimization doesn't
do good (and see above for why it may not be obvious which is better),
then that optimization isn't worth its name.

Anyway, for these issues I tend to think that I know less well what is
the best, than compilers do, or at least will be able to do soonish,
and the best answer may not be the same over all systems and processors.
And readability is IMVHO a lot better with short, meaningful functions.
Granted, both are to some extent personal preference and experience.

> > - for easier profiling with gprof (when optimization is turned off),
> Sorry, I hate to write source code to care about deficiencies of a
> particular tool.

Erm, like compilers with deficient inlining capabilities?  Sorry, but
this is a non-argument:  You are not accommodating the deficient tool,
you would be accommodating me, or whoever else happens to be willing to
work on this code, by making it harder to read.

Speaking of which, it would be very cool if the gettext CVS had commit
emails turned on to some list.  Not that I have concrete intentions to
do a lot with the code, but even just for following development, I've
become so spoiled with git that CVS feels like "no idea what's going on

> > - to eventually look into building a product metric out of them, for
> >   "clustering" or so.
> Show me that you can do something with the "clustering" idea here, then
> I'm open to all patches that you send :-)

Point well taken.  Hot air doesn't accomplish much.

> > Just curious: is it advantageous to traverse the strings backwards?
> I think the order should have no effect on the cache, since the cache lines
> will simply be traversed in opposite order. The next most important topic
> to look at, for speed, are the CPU instructions. And indeed, when the compiler
> does not apply loop strength reduction (i.e. rearrange the loop in completely
> unobvious ways) comparing i against 0 rather than against xvec_length saves
> 2 instructions inside the loop on x86, 1 instruction on SPARC. See the 
> attached
> output, produced with gcc 4.3.0.

Wow, how disappointing; I hope 4.4.0 will do better.  :-/
Thanks for measuring this.

> +/* In the code below, branch probabilities were measured by Ralf Wildenhues,
> +   by running "msgmerge LL.po coreutils.pot" with msgmerge 0.18 for many
> +   values of LL.  The probability indicates that the condition evaluates
> +   to true; whether that leads to a branch or a non-branch in the code,
> +   depends on the compiler's reordering of basic blocks.  */

Would the code benefit from 'likely' and 'unlikely' macros like used in

#define likely(x)       __builtin_expect((x),1)
#define unlikely(x)     __builtin_expect((x),0)

That looks like gnulib could benefit from, moving away from having tests
for this in macros like gt_INTL_SUBDIR_CORE, or in memmem.c but only if
_LIBC is defined.  (Yep, there goes that hot air again ... ;-)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]