qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-sy


From: wang Tiger
Subject: Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
Date: Thu, 22 Jul 2010 23:19:48 +0800

在 2010年7月22日 下午9:00,Jan Kiszka <address@hidden> 写道:
> Stefan Hajnoczi wrote:
>> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei <address@hidden> wrote:
>>> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>>>
>>>> Am 21.07.2010 09:03, schrieb Chen Yufei:
>>>>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>>>>
>>>>>
>>>>>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei<address@hidden>  wrote:
>>>>>>
>>>>>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" 
>>>>>>> full-system emulator built on Qemu. (Simply speaking, we made Qemu 
>>>>>>> parallel.)
>>>>>>>
>>>>>>> The project web page is located at:
>>>>>>> http://ppi.fudan.edu.cn/coremu
>>>>>>>
>>>>>>> You can also download the source code, images for playing on sourceforge
>>>>>>> http://sf.net/p/coremu
>>>>>>>
>>>>>>> COREMU is composed of
>>>>>>> 1. a parallel emulation library
>>>>>>> 2. a set of patches to qemu
>>>>>>> (We worked on the master branch, commit 
>>>>>>> 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>>>>>
>>>>>>> It currently supports full-system emulation of x64 and ARM MPcore 
>>>>>>> platforms.
>>>>>>>
>>>>>>> By leveraging the underlying multicore resources, it can emulate up to 
>>>>>>> 255 cores running commodity operating systems (even on a 4-core 
>>>>>>> machine).
>>>>>>>
>>>>>>> Enjoy,
>>>>>>>
>>>>>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>>>>>
>>>>> It would be great if we can submit our code to QEMU, but we do not know 
>>>>> the process.
>>>>> Would you please give us some instructions?
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Chen Yufei
>>>>>
>>>> Some hints can be found here:
>>>> http://wiki.qemu.org/Contribute/StartHere
>>>>
>>>> Kind regards,
>>>> Stefan Weil
>>> The patch is in the attachment, produced with command
>>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>>>
>>> In order to separate what need to be done to make QEMU parallel, we created 
>>> a separate library, and the patched QEMU need to be compiled and linked 
>>> with that library. To submit our enhancement to QEMU, maybe we need to 
>>> incorporate this library into QEMU. I don't know what would be the best 
>>> solution.
>>>
>>> Our approach to make QEMU parallel can be found at 
>>> http://ppi.fudan.edu.cn/coremu
>>>
>>> I will give a short summary here:
>>>
>>> 1. Each emulated core thread runs a separate binary translator engine and 
>>> has private code cache. We marked some variables in TCG as thread local. We 
>>> also modified the TB invalidation mechanism.
>>>
>>> 2. Each core has a queue holding pending interrupts. The COREMU library 
>>> provides this queue, and interrupt notification is done by sending realtime 
>>> signals to the emulated core thread.
>>>
>>> 3. Atomic instruction emulation has to be modified for parallel emulation. 
>>> We use lightweight memory transaction which requires only compare-and-swap 
>>> instruction to emulate atomic instruction.
>>>
>>> 4. Some code in the original QEMU may cause data race bug after we make it 
>>> parallel. We fixed these problems.
>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Chen Yufei
>>
>> Looking at the patch it seems there is a global lock for hardware
>> access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
>> tried running and do you have lock contention data for cm_hw_lock?

The global lock for hardware access is only for ARM target in our
implementation. It is mainly because that we are not quite familiar
with ARM. 4 ARM cores (Cortex A9 limitation) could be emulated in such
way.
For x86_64 target, we have already made hardware emulation
concurrently accessed. We can emulate 255 cores on a quad-core
machine.

>
> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
> sleepy lock here which is likely better for the code paths protected by
> it in upstream. Are they shorter in COREMU?
>
>> Have you thought about making hardware emulation concurrent?
>>
>> These are issues that qemu-kvm faces today since it executes vcpu
>> threads in parallel.  Both qemu-kvm and the COREMU patches could
>> benefit from a solution for concurrent hardware access.

In our implementation for x86_64 target, all devices except LAPIC are
emulated in a seperate thread. VCPUs are emulated  in other threads
(one thread per VCPU).
By observing some device drivers in linux, we have a hypothethis that
drivers in OS have already ensured correct synchronization on
concurrent hardware accesses.

For example, when emulating IDE with bus master DMA,
1. Two VCPUs will not send disk w/r requests at the same time.
2. New DMA request will not be sent until the previous one has completed.
These two points guarantee the emulated IDE with DMA can be
concurrently accessed by either VCPU thread or hw thread with no
additional locks.

The only work we need to do is to fix some misbehaving emulated device
in current Qemu.
For example, in the function ide_write_dma_cb of Qemu

if (s->nsector == 0) {
        s->status = READY_STAT | SEEK_STAT;
        ide_set_irq(s->bus);
/* In parallel emulation, OS may receive interrupt here before the DMA
state is updated */
    eot:
        bm->status &= ~BM_STATUS_DMAING;
        bm->status |= BM_STATUS_INT;
        bm->dma_cb = NULL;
        bm->unit = -1;
        bm->aiocb = NULL;
        return;
    }

The DMA state is changed after the IRQ has been sent. This is correct
in sequantial emulation. But in parallel emulation, OS may find the
DMA is busy even after an end of request interrupt is received.
The correct solution should be:

if (s->nsector == 0) {
        s->status = READY_STAT | SEEK_STAT;
/* For coremu dma state need to be changed before irq is sent */
        bm->status &= ~BM_STATUS_DMAING;
        bm->status |= BM_STATUS_INT;
        bm->dma_cb = NULL;
        bm->unit = -1;
        bm->aiocb = NULL;
        ide_set_irq(s->bus);
        return;
       eot:
       ...
}

The DMA state need to be changed before the IRQ has been sent as what
real hardware does.

Our evaluation shows that the implementation based on this hypothethis
could correctly handle concurrent  device accesses.
We also use a per VCPU lock-free queue to hold interrupts information
for each VCPU.

For your convience, here is the url for our project
http://sourceforge.net/p/coremu/
We will do our best to merge our code to upstream. :-)

>
> While we are all looking forward to see more scalable hardware models
> :), I think it is a topic that can be addressed widely independent of
> parallelizing TCG VCPUs. The latter can benefit from the former, for
> sure, but it first of all has to solve its own issues.
>
> Note that --enable-io-thread provides truly parallel KVM VCPUs also in
> upstream these days. Just for TCG, we need that sightly suboptimal CPU
> scheduling inside single-threaded tcg_cpu_exec (was renamed to
> cpu_exec_all today).
>
> Jan
>
> --
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux
>
>



-- 
Zhaoguo Wang, Parallel Processing Institute, Fudan University

Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China

address@hidden
http://ppi.fudan.edu.cn/zhaoguo_wang



reply via email to

[Prev in Thread] Current Thread [Next in Thread]