qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations


From: Peter Lieven
Subject: Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
Date: Mon, 25 Mar 2013 14:32:13 +0100

Am 25.03.2013 um 14:23 schrieb Peter Lieven <address@hidden>:

> 
> Am 25.03.2013 um 14:02 schrieb Paolo Bonzini <address@hidden>:
> 
>>> Maybe I should have explained the output more detailed. The percentages
>>> are added. 35.8% in the second last column means that
>>> 35.8% have a return value that is less than TARGET_PAGE_SIZE.
>>> This was meant to illustrate at how many 64-bit chunks you have
>>> to look to grab a certain percentage of non-zero pages.
>> 
>> Ok, I wrongly understood that many pages had 4088 zero bytes but
>> the last 8 were not zero.  Now it's clearer, and more logical too. :)
>> 
>>> Looking e.g. at the third value it means that looking at the first
>>> three 64-bit chunks it will catch 34.0% of all pages.
>>> It turns out that the non-zeroness of a page can be detected looking
>>> at the first 256 or so bits and only a low
>>> percentage turns out to be non-zero at a later position. So after
>>> having checked the first chunks one by one
>>> there is no big penalty looking at the remaining chunks with the
>>> vectorized loop.
>> 
>> I think it makes most sense to unroll the first four non-vectorized
>> iterations, i.e. not use SSE and use three or four ifs.  Either:
>> 
>>  if (foo[0]) return 0;
>>  if (foo[1]) return 8;
>>  if (foo[2]) return 16;
>>  if (foo[3]) return 24;
>> 
>> or
>> 
>>  if (foo[0]) return 0;
>>  if (foo[1] | foo[2] | foo[3]) return 8;
>> 
>> and then proceed on the remaining 4096-4*sizeof(long) bytes with
>> the vectorized loop.  foo+4 is aligned for SIMD operations on both
>> 32- and 64-bit machines, which makes this a nice choice.
> 
> i can't start at foo+4 since the remaining X-4*sizeof(long) bytes
> are not dividable by 8*sizeof(VECTYPE).
> 
> I could just do sty like the following:
> 
>    const unsigned long *tmp = buf;
> 
>    for (i = 0; 
>         i < sizeof(VECTYPE) * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>             / sizeof(unsigned long);
>         i += 4) {
>        if (tmp[i + 0]) return i * sizeof(unsigned long);
>        if (tmp[i + 1]) return (i+1) * sizeof(unsigned long);
>        if (tmp[i + 2]) return (i+2) * sizeof(unsigned long);
>        if (tmp[i + 3]) return (i+3) * sizeof(unsigned long);
>    }
> 
>    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; 
>         i < len / sizeof(VECTYPE); 
>         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>
>    }

performance of the above is bad compared to:

    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
        if (!ALL_EQ(p[i], zero)) {
            return i * sizeof(VECTYPE);
        }
    }

…

The above is basically what old is_dup_page is doing, but after the first
8 iterations the optimized version kicks in.

Peter




reply via email to

[Prev in Thread] Current Thread [Next in Thread]