qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v7 00/46] CXl 2.0 emulation Support


From: Mark Cave-Ayland
Subject: Re: [PATCH v7 00/46] CXl 2.0 emulation Support
Date: Thu, 17 Mar 2022 08:12:56 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.6.2

On 16/03/2022 18:26, Jonathan Cameron via wrote:
On Wed, 16 Mar 2022 17:58:46 +0000
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

On Wed, 16 Mar 2022 17:16:55 +0000
Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> wrote:

On 16/03/2022 16:50, Jonathan Cameron via wrote:
On Thu, 10 Mar 2022 16:02:22 +0800
Peter Xu <peterx@redhat.com> wrote:
On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:
Hi Peter,

Hi, Jonathan,

20220306174137.5707-35-Jonathan.Cameron@huawei.com/">https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/

Having mr->ops set but with memory_access_is_direct() returning true sounds
weird to me.

Sorry to have no understanding of the whole picture, but.. could you share
more on what's the interleaving requirement on the proxying, and why it
can't be done with adding some IO memory regions as sub-regions upon the
file one?

The proxying requirement is simply a means to read/write to a computed address
within a memory region. There may well be a better way to do that.

If I understand your suggestion correctly you would need a very high
number of IO memory regions to be created dynamically when particular sets of
registers across multiple devices in the topology are all programmed.

The interleave can be 256 bytes across up to 16x, many terabyte, devices.
So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
IO regions.  Even for a minimal useful test case of largest interleave
set of 16x 256MB devices (256MB is minimum size the specification allows per
decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
Any idea if that approach would scale sensibly to this number of regions?

There are also complexities to getting all the information in one place to
work out which IO memory regions maps where in PA space. Current solution is
to do that mapping in the same way the hardware does which is hierarchical,
so we walk the path to the device, picking directions based on each interleave
decoder that we meet.
Obviously this is a bit slow but I only really care about correctness at the
moment.  I can think of various approaches to speeding it up but I'm not sure
if we will ever care about performance.

https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
has the logic for that and as you can see it's fairly simple because we are 
always
going down the topology following the decoders.

Below I have mapped out an algorithm I think would work for doing it with
IO memory regions as subregions.

We could fake the whole thing by limiting ourselves to small host
memory windows which are always directly backed, but then I wouldn't
achieve the main aim of this which is to provide a test base for the OS code.
To do that I need real interleave so I can seed the files with test patterns
and verify the accesses hit the correct locations. Emulating what the hardware
is actually doing on a device by device basis is the easiest way I have
come up with to do that.

Let me try to provide some more background so you hopefully don't have
to have read the specs to follow what is going on!
There are an example for directly connected (no switches) topology in the
docs

https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst

The overall picture is we have a large number of CXL Type 3 memory devices,
which at runtime (by OS at boot/on hotplug) are configured into various
interleaving sets with hierarchical decoding at the host + host bridge
+ switch levels. For test setups I probably need to go to around 32 devices
so I can hit various configurations simultaneously.
No individual device has visibility of the full interleave setup - hence
the walk in the existing code through the various decoders to find the
final Device Physical address.

At the host level the host provides a set of Physical Address windows with
a fixed interleave decoding across the different host bridges in the system
(CXL Fixed Memory windows, CFMWs)
On a real system these have to be large enough to allow for any memory
devices that might be hotplugged and all possible configurations (so
with 2 host bridges you need at least 3 windows in the many TB range,
much worse as the number of host bridges goes up). It'll be worse than
this when we have QoS groups, but the current Qemu code just puts all
the windows in group 0.  Hence my first thought of just putting memory
behind those doesn't scale (a similar approach to this was in the
earliest versions of this patch set - though the full access path
wasn't wired up).

The granularity can be in powers of 2 from 256 bytes to 16 kbytes

Next each host bridge has programmable address decoders which take the
incoming (often already interleaved) memory access and direct them to
appropriate root ports.  The root ports can be connected to a switch
which has additional address decoders in the upstream port to decide
which downstream port to route to.  Note we currently only support 1 level
of switches but it's easy to make this algorithm recursive to support
multiple switch levels (currently the kernel proposals only support 1 level)

Finally the End Point with the actual memory receives the interleaved request 
and
takes the full address and (for power of 2 decoding - we don't yet support
3,6 and 12 way which is more complex and there is no kernel support yet)
it drops a few address bits and adds an offset for the decoder used to
calculate it's own device physical address.  Note device will support
multiple interleave sets for different parts of it's file once we add
multiple decoder support (on the todo list).

So the current solution is straight forward (with the exception of that
proxying) because it follows the same decoding as used in real hardware
to route the memory accesses. As a result we get a read/write to a
device physical address and hence proxy that.  If any of the decoders
along the path are not configured then we error out at that stage.

To create the equivalent as IO subregions I think we'd have to do the
following from (this might be mediated by some central entity that
doesn't currently exist, or done on demand from which ever CXL device
happens to have it's decoder set up last)

1) Wait for a decoder commit (enable) on any component. Goto 2.
2) Walk the topology (up to host decoder, down to memory device)
If a complete interleaving path has been configured -
     i.e. we have committed decoders all the way to the memory
     device goto step 3, otherwise return to step 1 to wait for
     more decoders to be committed.
3) For the memory region being supplied by the memory device,
     add subregions to map the device physical address (address
     in the file) for each interleave stride to the appropriate
     host Physical Address.
4) Return to step 1 to wait for more decoders to commit.

So summary is we can do it with IO regions, but there are a lot of them
and the setup is somewhat complex as we don't have one single point in
time where we know all the necessary information is available to compute
the right addresses.

Looking forward to your suggestions if I haven't caused more confusion!

Hi Peter,

Thanks for the write up - I must confess they're a lot! :)

I merely only learned what is CXL today, and I'm not very experienced on
device modeling either, so please bare with me with stupid questions..

IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
That's a normal IO region, which looks very reasonable.

However I'm confused why patch "RFC: softmmu/memory: Add ops to
memory_region_ram_init_from_file" helped.

Per my knowledge, all the memory accesses upon this CFMW window trapped
using this IO region already.  There can be multiple memory file objects
underneath, and when read/write happens the object will be decoded from
cxl_cfmws_find_device() as you referenced.

Yes.

However I see nowhere that these memory objects got mapped as sub-regions
into parent (CXLFixedWindow.mr).  Then I don't understand why they cannot
be trapped.

AS you note they aren't mapped into the parent mr, hence we are trapping.
The parent mem_ops are responsible for decoding the 'which device' +
'what address in device memory space'. Once we've gotten that info
the question is how do I actually do the access?

Mapping as subregions seems unwise due to the huge number required.

To ask in another way: what will happen if you simply revert this RFC
patch?  What will go wrong?

The call to memory_region_dispatch_read()
https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556

would call memory_region_access_valid() that calls
mr->ops->valid.accepts() which is set to
unassigned_mem_accepts() and hence...
you get back a MEMTX_DECODE_ERROR back and an exception in the
guest.

That wouldn't happen with a non proxied access to the ram as
those paths never uses the ops as memory_access_is_direct() is called
and simply memcpy used without any involvement of the ops.

Is a better way to proxy those writes to the backing files?

I was fishing a bit in the dark here and saw the existing ops defined
for a different purpose for VFIO

4a2e242bbb ("memory Don't use memcpy for ram_device regions")

and those allowed the use of memory_region_dispatch_write() to work.

Hence the RFC marking on that patch :)

FWIW I had a similar issue implementing manual aliasing in one of my q800 
patches
where I found that dispatching a read to a non-IO memory region didn't work with
memory_region_dispatch_read(). The solution in my case was to switch to using 
the
address space API instead, which whilst requiring an absolute address for the 
target
address space, handles the dispatch correctly across all different memory 
region types.

Have a look at
https://gitlab.com/mcayland/qemu/-/commit/318e12579c7570196187652da13542db86b8c722
 to
see how I did this in macio_alias_read().

IIRC from my experiments in this area, my conclusion was that
memory_region_dispatch_read() can only work correctly if mapping directly 
between 2
IO memory regions, and for anything else you need to use the address space API.

Hi Mark,

I'd wondered about the address space API as an alternative approach.

 From that reference looks like you have the memory mapped into the system 
address
space and are providing an alias to that.  That's something I'd ideally like to
avoid doing as there is no meaningful way to do it so I'd just be hiding the 
memory
somewhere up high.  The memory should only be accessible through the one
route.

I think I could spin a separate address space for this purpose (one per CXL 
type 3
device probably) but that seems like another nasty hack to make. I'll try a 
quick
prototype of this tomorrow.

Turned out to be trivial so already done.  Will send out as v8 unless anyone
feeds back that there is a major disadvantage to just spinning up one address 
space
per CXL type3 device.  That will mean dropping the RFC patch as well as no 
longer
used :)

Thanks for the hint Mark.

Jonathan

Ah great! As you've already noticed my particular case was performing partial decoding on a memory region, but there are no issues if you need to dispatch to another existing address space such as PCI/IOMMU. Creating a separate address space per device shouldn't be an issue either, as that's effectively how the PCI bus master requests are handled.

The address spaces are visible in "info mtree" so if you haven't already, I would recommend generating a dynamic name for the address space based upon the device name/address to make it easier for development and debugging.


ATB,

Mark.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]