qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 01/22] instrument: Add documentation


From: Lluís Vilanova
Subject: Re: [Qemu-devel] [PATCH v6 01/22] instrument: Add documentation
Date: Mon, 25 Sep 2017 21:03:39 +0300
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux)

First, sorry for the late response; I was away for a few days.


Peter Maydell writes:

> On 18 September 2017 at 18:09, Lluís Vilanova <address@hidden> wrote:
>> Peter Maydell writes:
>>> It's also exposing internal QEMU implementation detail.
>>> What if in future we decide to switch from our current
>>> setup to always interpreting guest instructions as a
>>> first pass with JITting done only in the background for
>>> hot code?
>> 
>> TCI still has a separation of translation-time (translate.c) and 
>> execution-time
>> (interpreting the TCG opcodes), and I don't think that's gonna go away 
>> anytime
>> soon.

> I didn't mean TCI, which is nothing like what you'd use for
> this if you did it (TCI is slower than just JITting.)

My point is that even on the cold path you need to decode a guest instruction
(equivalent to translating) and emulate it on the spot (equivalent to
executing).


>> Even if it did, I think there still will be a translation/execution 
>> separation
>> easy enough to hook into (even if it's a "fake" one for the cold-path
>> interpreted instructions).

> But what would it mean? You don't have basic blocks any more.

Every instruction emulated on the spot can be seen as a newly translated block
(of one instruction only), which is executed immediately after.


>>> Sticking to instrumentation events that correspond exactly to guest
>>> execution events means they won't break or expose internals.
>> 
>> It also means we won't be able to "conditionally" instrument instructions 
>> (e.g.,
>> based on their opcode, address range, etc.).

> You can still do that, it's just less efficient (your
> condition-check happens in the callout to the instrumentation
> plugin). We can add "filter" options later if we need them
> (which I would rather do than have translate-time callbacks).

Before answering, a short summary of when knowing about translate/execute makes
a difference:

* Record some information only once when an instruction is translated, instead
  of recording it on every executed instruction (e.g., a study of opcode
  distribution, which you can get from a file of per-TB opcodes - generated at
  translation time - and a list of executed TBs - generated at execution time
  -). The translate/execute separation makes this run faster *and* produces much
  smaller files with the recorded info.

  Other typical examples that benefit from this are writing a simulator that
  feeds off a stream of instruction information (a common reason why people want
  to trace memory accesses and information of executed instructions).

* Conditionally instrumenting instructions.

Adding filtering to the instrumentation API would only solve the second point,
but not the first one.

Now, do we need/want to support the first point?


>> Of course we can add the translation/execution differentiation later if we 
>> find
>> it necessary for performance, but I would rather avoid leaving "historical"
>> instrumentation points behind on the API.
>> 
>> What are the use-cases you're aiming for?

> * I want to be able to point the small stream of people who come
> into qemu-devel asking "how do I trace all my guest's memory
> accesses" at a clean API for it.

> * I want to be able to have less ugly and confusing tracing
> than our current -d output (and perhaps emit tracing in formats
> that other analysis tools want as input)

> * I want to keep this initial tracing API simple enough that
> we can agree on it and get a first working useful version.

Fair enough.

I know it's not exactly the same we're discussing, but the plot in [1] compares
a few different ways to trace memory accesses on SPEC benchmarks:

* First bar is using a Intel's tool called PIN [2].
* Second is calling into an instrumentation function on every executed memory
  access in QEMU.
* Third is embedding the hot path of writing the memory access info to an array
  into the TCG opcode stream (more or less equivalent to supporting filtering;
  when the array is full, a user's callback is called - cold path -)
* Fourth bar can be ignored.

This was working on a much older version of instrumentation for QEMU, but I can
implement something that does the first use-case point above and some filtering
example (second use-case point) to see what's the performance difference.

[1] https://filetea.me/n3wy9WwyCCZR72E9OWXHArHDw
[2] 
https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool


Thanks!
  Lluis



reply via email to

[Prev in Thread] Current Thread [Next in Thread]