qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] find_next_bit optimizations


From: ronnie sahlberg
Subject: Re: [Qemu-devel] [RFC] find_next_bit optimizations
Date: Mon, 11 Mar 2013 10:06:47 -0700

Even more efficient might be to do bitwise instead of logical or

>        if (tmp | d1 | d2 | d3) {

that should remove 3 of the 4 conditional jumps
and should become 3 bitwise ors and one conditional jump


On Mon, Mar 11, 2013 at 8:58 AM, Paolo Bonzini <address@hidden> wrote:
> Il 11/03/2013 16:37, Peter Lieven ha scritto:
>>
>> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <address@hidden>:
>>
>>> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>>>>
>>>>> How would that be different in your patch?  But you can solve it by
>>>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>>>> BITS_PER_LONG.
>>>>
>>>> This is what I have now:
>>>>
>>>> diff --git a/util/bitops.c b/util/bitops.c
>>>> index e72237a..b0dc93f 100644
>>>> --- a/util/bitops.c
>>>> +++ b/util/bitops.c
>>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, 
>>>> unsigned long size,
>>>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>>     unsigned long tmp;
>>>> +    unsigned long d0,d1,d2,d3;
>>>>
>>>>     if (offset >= size) {
>>>>         return size;
>>>>     }
>>>>     size -= result;
>>>> -    offset %= BITS_PER_LONG;
>>>> +    offset &= (BITS_PER_LONG-1);
>>>>     if (offset) {
>>>>         tmp = *(p++);
>>>>         tmp &= (~0UL << offset);
>>>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, 
>>>> unsigned long size,
>>>>         size -= BITS_PER_LONG;
>>>>         result += BITS_PER_LONG;
>>>>     }
>>>> -    while (size & ~(BITS_PER_LONG-1)) {
>>>> +    while (size >= 4*BITS_PER_LONG) {
>>>> +        d0 = *p;
>>>> +        d1 = *(p+1);
>>>> +        d2 = *(p+2);
>>>> +        d3 = *(p+3);
>>>> +        if (d0 || d1 || d2 || d3) {
>>>> +            break;
>>>> +        }
>>>> +        p+=4;
>>>> +        result += 4*BITS_PER_LONG;
>>>> +        size -= 4*BITS_PER_LONG;
>>>> +    }
>>>> +    while (size >= BITS_PER_LONG) {
>>>>         if ((tmp = *(p++))) {
>>>>             goto found_middle;
>>>>         }
>>>>
>>>
>>> Minus the %= vs. &=,
>>>
>>> Reviewed-by: Paolo Bonzini <address@hidden>
>>>
>>> Perhaps:
>>>
>>>        tmp = *p;
>>>        d1 = *(p+1);
>>>        d2 = *(p+2);
>>>        d3 = *(p+3);
>>>        if (tmp) {
>>>            goto found_middle;
>>>        }
>>>        if (d1 || d2 || d3) {
>>>            break;
>>>        }
>>
>> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?
>
> It depends on the target and how expensive branches are.
>
>> i would guess its sth like one addition w/ carry and 1 test?
>
> It could be either 4 compare-and-jump sequences, or 3 bitwise ORs
> followed by a compare-and-jump.
>
> That is, either:
>
>      test %r8, %r8
>      jz   second_loop
>      test %r9, %r9
>      jz   second_loop
>      test %r10, %r10
>      jz   second_loop
>      test %r11, %r11
>      jz   second_loop
>
> or
>
>      or %r9, %r8
>      or %r11, %r10
>      or %r8, %r10
>      jz   second_loop
>
> Don't let the length of the code fool you.  The processor knows how to
> optimize all of these, and GCC knows too.
>
>> your proposed change would introduce 2 tests (maybe)?
>
> Yes, but I expect they to be fairly well predicted.
>
>> what about this to be sure?
>>
>>        tmp = *p;
>>        d1 = *(p+1);
>>        d2 = *(p+2);
>>        d3 = *(p+3);
>>        if (tmp || d1 || d2 || d3) {
>>            if (tmp) {
>>                goto found_middle;
>
> I suspect that GCC would rewrite it my version (definitely if it
> produces 4 compare-and-jumps; but possibly it does it even if it goes
> for bitwise ORs, I haven't checked.
>
> Regarding your other question ("one last thought. would it make sense to
> update only `size`in the while loops and compute the `result` at the end
> as `orgsize` - `size`?"), again the compiler knows better and might even
> do this for you.  It will likely drop the p increases and use p[result],
> so if you do that change you may even get the same code, only this time
> p is increased and you get an extra subtraction at the end. :)
>
> Bottom line: don't try to outsmart an optimizing C compiler on
> micro-optimization, unless you have benchmarked it and it shows there is
> a problem.
>
> Paolo
>
>>            }
>>            break;
>>        }
>>
>> Peter
>>
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]