qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation


From: alvise rigo
Subject: Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation
Date: Mon, 11 May 2015 11:08:05 +0200

Hi,

On Fri, May 8, 2015 at 5:22 PM, Alex Bennée <address@hidden> wrote:
>
> Alvise Rigo <address@hidden> writes:
>
>> This patch series provides an infrastructure for atomic
>> instruction implementation in QEMU, paving the way for TCG multi-threading.
>> The adopted design does not rely on host atomic
>> instructions and is intended to propose a 'legacy' solution for
>> translating guest atomic instructions.
>
> Thanks for posting this.
>
>> The underlying idea is to provide new TCG instructions that guarantee
>> atomicity to some memory accesses or in general a way to define memory
>> transactions. More specifically, a new pair of TCG instructions are
>> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
>> LoadLink and StoreConditional primitives (only 32 bit variant
>> implemented).  In order to achieve this, a new bitmap is added to the
>> ram_list structure (always unique) which flags all memory pages that
>> could not be accessed directly through the fast-path, due to previous
>> exclusive operations. This new bitmap is coupled with a new TLB flag
>> which forces the slow-path exectuion. All stores which take place
>> between an LL/SC operation by other vCPUs in the same memory page, will
>> fail the subsequent StoreConditional.
>
> Do you have any figures for contention with these page aligned exclusive
> locks? On ARMv8 the global monitor reservation granule is IMDEF but
> generally smaller than the page size. If a large number of exclusively
> accessed variables end up in the same page there could be a lot of
> failed exclusive ops compare to the real world.

For sure the reservation granule is much bigger here than in an
average ARM implementation.
However, this could be somehow improved by allowing to have one exact
'linked' address per memory page.
In this way, all the accesses made to the page will follow the
slow-path, but only those writing to the linked address will make the
page dirty.
In essence, the following limitations will remain:
- slow-path forced at a page granularity
- only one linked address per page (this might not be the real behaviour)

>
>>
>> In theory, the provided implementation of TCG LoadLink/StoreConditional
>> can be used to properly handle atomic instructions on any architecture.
>>
>> The new slow-path is implemented such that:
>> - the LoadLink behaves as a normal load slow-path, except for cleaning
>>   the dirty flag in the bitmap. The TLB entries created from now on will
>>   force the slow-path. To ensure it, we flush the TLB cache for the
>>   other vCPUs
>> - the StoreConditional behaves as a normal store slow-path, except for
>>   checking the state of the dirty bitmap and returning 0 or 1 whether or
>>   not the StoreConditional succeeded (0 when no vCPU has touched the
>>   same memory in the mean time).
>>
>> All those write accesses that are forced to follow the 'legacy'
>> slow-path will set the accessed memory page to dirty.
>>
>> In this series only the ARM ldrex/strex instructions are implemented.
>> The code was tested with bare-metal test cases and with Linux, using
>> upstream QEMU.
>
> Have you developed any specific test cases to exercise the logic?

Yes, some trivial baremetal tests were I was writing to the linked
address in the middle of a LL/SC.
I think the next mandatory test is to run it in real multi-threading
to exercise even more the logic.

Thank you,
alvise

>
>>
>> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
>>
>> Alvise Rigo (5):
>>   exec: Add new exclusive bitmap to ram_list
>>   Add new TLB_EXCL flag
>>   softmmu: Add helpers for a new slow-path
>>   tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>>   target-arm: translate: implement qemu_ldlink and qemu_stcond ops
>>
>>  cputlb.c                |  11 ++-
>>  include/exec/cpu-all.h  |   1 +
>>  include/exec/cpu-defs.h |   2 +
>>  include/exec/memory.h   |   3 +-
>>  include/exec/ram_addr.h |  19 +++-
>>  softmmu_llsc_template.h | 233 
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>  softmmu_template.h      |  52 ++++++++++-
>>  target-arm/translate.c  |  94 ++++++++++++++++++-
>>  tcg/arm/tcg-target.c    | 105 ++++++++++++++++------
>>  tcg/tcg-be-ldst.h       |   2 +
>>  tcg/tcg-op.c            |  20 +++++
>>  tcg/tcg-op.h            |   3 +
>>  tcg/tcg-opc.h           |   4 +
>>  tcg/tcg.c               |   2 +
>>  tcg/tcg.h               |  20 +++++
>>  15 files changed, 538 insertions(+), 33 deletions(-)
>>  create mode 100644 softmmu_llsc_template.h
>
> --
> Alex Bennée



reply via email to

[Prev in Thread] Current Thread [Next in Thread]