Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking

From:	Vijay Kilari
Subject:	Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
Date:	Tue, 5 Jul 2016 17:54:18 +0530

On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson <address@hidden> wrote:
> On 06/30/2016 06:45 AM, Peter Maydell wrote:
>>
>> On 29 June 2016 at 09:47,  <address@hidden> wrote:
>>>
>>> From: Vijay <address@hidden>
>>>
>>> Use Neon instructions to perform zero checking of
>>> buffer. This is helps in reducing total migration time.
>>
>>
>>> diff --git a/util/cutils.c b/util/cutils.c
>>> index 5830a68..4779403 100644
>>> --- a/util/cutils.c
>>> +++ b/util/cutils.c
>>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>>>  #define SPLAT(p)       _mm_set1_epi8(*(p))
>>>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>> 0xFFFF)
>>>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
>>> +#elif __aarch64__
>>> +#include "arm_neon.h"
>>> +#define VECTYPE        uint64x2_t
>>> +#define ALL_EQ(v1, v2) \
>>> +        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
>>> +         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
>>> +#define VEC_OR(v1, v2) ((v1) | (v2))
>>
>>
>> Should be '#elif defined(__aarch64__)'. I have made this
>> tweak and put this patch in target-arm.next.
>
>
> Consider
>
> #define VECTYPE        uint32x4_t
> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)
>
>
> which compiles down to
>
>   1c:   6e211c00        eor     v0.16b, v0.16b, v1.16b
>   20:   6eb0a800        umaxv   s0, v0.4s
>   24:   1e260000        fmov    w0, s0
>   28:   6b1f001f        cmp     w0, wzr
>   2c:   1a9f17e0        cset    w0, eq
>   30:   d65f03c0        ret

For me this code compiles as below and migration time is ~100ms more.

See below 3 trails of migration time

  7039cc:       6eb0a800        umaxv   s0, v0.4s
  7039d0:       0e043c02        mov     w2, v0.s[0]
  7039d4:       350000c2        cbnz    w2, 7039ec <f0+0xf4>
  7039d8:       91002084        add     x4, x4, #0x8
  7039dc:       91020063        add     x3, x3, #0x80
  7039e0:       eb01009f        cmp     x4, x1

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3070 milliseconds
downtime: 55 milliseconds
setup: 4 milliseconds
transferred ram: 300637 kbytes
throughput: 802.49 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062834 pages
skipped: 0 pages
normal: 70489 pages
normal bytes: 281956 kbytes
dirty sync count: 3

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3067 milliseconds
downtime: 47 milliseconds
setup: 5 milliseconds
transferred ram: 290277 kbytes
throughput: 775.61 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2064185 pages
skipped: 0 pages
normal: 67901 pages
normal bytes: 271604 kbytes
dirty sync count: 3
(qemu)

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3067 milliseconds
downtime: 34 milliseconds
setup: 5 milliseconds
transferred ram: 294614 kbytes
throughput: 787.19 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2063365 pages
skipped: 0 pages
normal: 68985 pages
normal bytes: 275940 kbytes
dirty sync count: 3

>
> vs
>
>   34:   4e083c20        mov     x0, v1.d[0]
>   38:   4e083c01        mov     x1, v0.d[0]
>   3c:   eb00003f        cmp     x1, x0
>   40:   52800000        mov     w0, #0
>   44:   54000040        b.eq    4c <f0+0x18>
>   48:   d65f03c0        ret
>   4c:   4e183c20        mov     x0, v1.d[1]
>   50:   4e183c01        mov     x1, v0.d[1]
>   54:   eb00003f        cmp     x1, x0
>   58:   1a9f17e0        cset    w0, eq
>   5c:   d65f03c0        ret
>

My patch compiles to below code and takes ~100ms less time

#define VECTYPE        uint64x2_t
#define ALL_EQ(v1, v2) \
        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))

  7039d0:       4e083c02        mov     x2, v0.d[0]
  7039d4:       b5000102        cbnz    x2, 7039f4 <f0+0xfc>
  7039d8:       4e183c02        mov     x2, v0.d[1]
  7039dc:       b50000c2        cbnz    x2, 7039f4 <f0+0xfc>
  7039e0:       91002084        add     x4, x4, #0x8
  7039e4:       91020063        add     x3, x3, #0x80
  7039e8:       eb04003f        cmp     x1, x4

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2973 milliseconds
downtime: 67 milliseconds
setup: 5 milliseconds
transferred ram: 293659 kbytes
throughput: 809.45 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062791 pages
skipped: 0 pages
normal: 68748 pages
normal bytes: 274992 kbytes
dirty sync count: 3
(qemu)

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2972 milliseconds
downtime: 47 milliseconds
setup: 5 milliseconds
transferred ram: 295972 kbytes
throughput: 816.10 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062861 pages
skipped: 0 pages
normal: 69325 pages
normal bytes: 277300 kbytes
dirty sync count: 3
(qemu)

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2982 milliseconds
downtime: 40 milliseconds
setup: 5 milliseconds
transferred ram: 293386 kbytes
throughput: 806.26 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2063199 pages
skipped: 0 pages
normal: 68679 pages
normal bytes: 274716 kbytes
dirty sync count: 4
(qemu)

Regards
Vijay

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking, Richard Henderson, 2016/07/01
- Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking, Peter Maydell, 2016/07/02
- Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking, Vijay Kilari <=
  - Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking, Peter Maydell, 2016/07/11

Prev by Date: Re: [Qemu-devel] [PATCH for-2.7 7/8] s390x/css: Factor out virtual css bridge and bus
Next by Date: Re: [Qemu-devel] [PATCH v2 0/6] x86: Physical address limit patches
Previous by thread: Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
Next by thread: Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
Index(es):
- Date
- Thread