[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v7 00/46] CXl 2.0 emulation Support
From: |
Jonathan Cameron |
Subject: |
Re: [PATCH v7 00/46] CXl 2.0 emulation Support |
Date: |
Wed, 16 Mar 2022 18:26:18 +0000 |
On Wed, 16 Mar 2022 17:58:46 +0000
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> On Wed, 16 Mar 2022 17:16:55 +0000
> Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> wrote:
>
> > On 16/03/2022 16:50, Jonathan Cameron via wrote:
> >
> > > On Thu, 10 Mar 2022 16:02:22 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > >
> > >> On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:
> > >>> Hi Peter,
> > >>
> > >> Hi, Jonathan,
> > >>
> > >>>
> > >>>>
> > >>>> 20220306174137.5707-35-Jonathan.Cameron@huawei.com/">https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
> > >>>>
> > >>>> Having mr->ops set but with memory_access_is_direct() returning true
> > >>>> sounds
> > >>>> weird to me.
> > >>>>
> > >>>> Sorry to have no understanding of the whole picture, but.. could you
> > >>>> share
> > >>>> more on what's the interleaving requirement on the proxying, and why it
> > >>>> can't be done with adding some IO memory regions as sub-regions upon
> > >>>> the
> > >>>> file one?
> > >>>
> > >>> The proxying requirement is simply a means to read/write to a computed
> > >>> address
> > >>> within a memory region. There may well be a better way to do that.
> > >>>
> > >>> If I understand your suggestion correctly you would need a very high
> > >>> number of IO memory regions to be created dynamically when particular
> > >>> sets of
> > >>> registers across multiple devices in the topology are all programmed.
> > >>>
> > >>> The interleave can be 256 bytes across up to 16x, many terabyte,
> > >>> devices.
> > >>> So assuming a simple set of 16 1TB devices I think you'd need about
> > >>> 4x10^9
> > >>> IO regions. Even for a minimal useful test case of largest interleave
> > >>> set of 16x 256MB devices (256MB is minimum size the specification
> > >>> allows per
> > >>> decoded region at the device) and 16 way interleave we'd need 10^6 IO
> > >>> regions.
> > >>> Any idea if that approach would scale sensibly to this number of
> > >>> regions?
> > >>>
> > >>> There are also complexities to getting all the information in one place
> > >>> to
> > >>> work out which IO memory regions maps where in PA space. Current
> > >>> solution is
> > >>> to do that mapping in the same way the hardware does which is
> > >>> hierarchical,
> > >>> so we walk the path to the device, picking directions based on each
> > >>> interleave
> > >>> decoder that we meet.
> > >>> Obviously this is a bit slow but I only really care about correctness
> > >>> at the
> > >>> moment. I can think of various approaches to speeding it up but I'm
> > >>> not sure
> > >>> if we will ever care about performance.
> > >>>
> > >>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
> > >>> has the logic for that and as you can see it's fairly simple because we
> > >>> are always
> > >>> going down the topology following the decoders.
> > >>>
> > >>> Below I have mapped out an algorithm I think would work for doing it
> > >>> with
> > >>> IO memory regions as subregions.
> > >>>
> > >>> We could fake the whole thing by limiting ourselves to small host
> > >>> memory windows which are always directly backed, but then I wouldn't
> > >>> achieve the main aim of this which is to provide a test base for the OS
> > >>> code.
> > >>> To do that I need real interleave so I can seed the files with test
> > >>> patterns
> > >>> and verify the accesses hit the correct locations. Emulating what the
> > >>> hardware
> > >>> is actually doing on a device by device basis is the easiest way I have
> > >>> come up with to do that.
> > >>>
> > >>> Let me try to provide some more background so you hopefully don't have
> > >>> to have read the specs to follow what is going on!
> > >>> There are an example for directly connected (no switches) topology in
> > >>> the
> > >>> docs
> > >>>
> > >>> https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst
> > >>>
> > >>> The overall picture is we have a large number of CXL Type 3 memory
> > >>> devices,
> > >>> which at runtime (by OS at boot/on hotplug) are configured into various
> > >>> interleaving sets with hierarchical decoding at the host + host bridge
> > >>> + switch levels. For test setups I probably need to go to around 32
> > >>> devices
> > >>> so I can hit various configurations simultaneously.
> > >>> No individual device has visibility of the full interleave setup - hence
> > >>> the walk in the existing code through the various decoders to find the
> > >>> final Device Physical address.
> > >>>
> > >>> At the host level the host provides a set of Physical Address windows
> > >>> with
> > >>> a fixed interleave decoding across the different host bridges in the
> > >>> system
> > >>> (CXL Fixed Memory windows, CFMWs)
> > >>> On a real system these have to be large enough to allow for any memory
> > >>> devices that might be hotplugged and all possible configurations (so
> > >>> with 2 host bridges you need at least 3 windows in the many TB range,
> > >>> much worse as the number of host bridges goes up). It'll be worse than
> > >>> this when we have QoS groups, but the current Qemu code just puts all
> > >>> the windows in group 0. Hence my first thought of just putting memory
> > >>> behind those doesn't scale (a similar approach to this was in the
> > >>> earliest versions of this patch set - though the full access path
> > >>> wasn't wired up).
> > >>>
> > >>> The granularity can be in powers of 2 from 256 bytes to 16 kbytes
> > >>>
> > >>> Next each host bridge has programmable address decoders which take the
> > >>> incoming (often already interleaved) memory access and direct them to
> > >>> appropriate root ports. The root ports can be connected to a switch
> > >>> which has additional address decoders in the upstream port to decide
> > >>> which downstream port to route to. Note we currently only support 1
> > >>> level
> > >>> of switches but it's easy to make this algorithm recursive to support
> > >>> multiple switch levels (currently the kernel proposals only support 1
> > >>> level)
> > >>>
> > >>> Finally the End Point with the actual memory receives the interleaved
> > >>> request and
> > >>> takes the full address and (for power of 2 decoding - we don't yet
> > >>> support
> > >>> 3,6 and 12 way which is more complex and there is no kernel support yet)
> > >>> it drops a few address bits and adds an offset for the decoder used to
> > >>> calculate it's own device physical address. Note device will support
> > >>> multiple interleave sets for different parts of it's file once we add
> > >>> multiple decoder support (on the todo list).
> > >>>
> > >>> So the current solution is straight forward (with the exception of that
> > >>> proxying) because it follows the same decoding as used in real hardware
> > >>> to route the memory accesses. As a result we get a read/write to a
> > >>> device physical address and hence proxy that. If any of the decoders
> > >>> along the path are not configured then we error out at that stage.
> > >>>
> > >>> To create the equivalent as IO subregions I think we'd have to do the
> > >>> following from (this might be mediated by some central entity that
> > >>> doesn't currently exist, or done on demand from which ever CXL device
> > >>> happens to have it's decoder set up last)
> > >>>
> > >>> 1) Wait for a decoder commit (enable) on any component. Goto 2.
> > >>> 2) Walk the topology (up to host decoder, down to memory device)
> > >>> If a complete interleaving path has been configured -
> > >>> i.e. we have committed decoders all the way to the memory
> > >>> device goto step 3, otherwise return to step 1 to wait for
> > >>> more decoders to be committed.
> > >>> 3) For the memory region being supplied by the memory device,
> > >>> add subregions to map the device physical address (address
> > >>> in the file) for each interleave stride to the appropriate
> > >>> host Physical Address.
> > >>> 4) Return to step 1 to wait for more decoders to commit.
> > >>>
> > >>> So summary is we can do it with IO regions, but there are a lot of them
> > >>> and the setup is somewhat complex as we don't have one single point in
> > >>> time where we know all the necessary information is available to compute
> > >>> the right addresses.
> > >>>
> > >>> Looking forward to your suggestions if I haven't caused more confusion!
> > >>>
> > >
> > > Hi Peter,
> > >
> > >>
> > >> Thanks for the write up - I must confess they're a lot! :)
> > >>
> > >> I merely only learned what is CXL today, and I'm not very experienced on
> > >> device modeling either, so please bare with me with stupid questions..
> > >>
> > >> IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
> > >> That's a normal IO region, which looks very reasonable.
> > >>
> > >> However I'm confused why patch "RFC: softmmu/memory: Add ops to
> > >> memory_region_ram_init_from_file" helped.
> > >>
> > >> Per my knowledge, all the memory accesses upon this CFMW window trapped
> > >> using this IO region already. There can be multiple memory file objects
> > >> underneath, and when read/write happens the object will be decoded from
> > >> cxl_cfmws_find_device() as you referenced.
> > >
> > > Yes.
> > >
> > >>
> > >> However I see nowhere that these memory objects got mapped as sub-regions
> > >> into parent (CXLFixedWindow.mr). Then I don't understand why they cannot
> > >> be trapped.
> > >
> > > AS you note they aren't mapped into the parent mr, hence we are trapping.
> > > The parent mem_ops are responsible for decoding the 'which device' +
> > > 'what address in device memory space'. Once we've gotten that info
> > > the question is how do I actually do the access?
> > >
> > > Mapping as subregions seems unwise due to the huge number required.
> > >
> > >>
> > >> To ask in another way: what will happen if you simply revert this RFC
> > >> patch? What will go wrong?
> > >
> > > The call to memory_region_dispatch_read()
> > > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556
> > >
> > > would call memory_region_access_valid() that calls
> > > mr->ops->valid.accepts() which is set to
> > > unassigned_mem_accepts() and hence...
> > > you get back a MEMTX_DECODE_ERROR back and an exception in the
> > > guest.
> > >
> > > That wouldn't happen with a non proxied access to the ram as
> > > those paths never uses the ops as memory_access_is_direct() is called
> > > and simply memcpy used without any involvement of the ops.
> > >
> > > Is a better way to proxy those writes to the backing files?
> > >
> > > I was fishing a bit in the dark here and saw the existing ops defined
> > > for a different purpose for VFIO
> > >
> > > 4a2e242bbb ("memory Don't use memcpy for ram_device regions")
> > >
> > > and those allowed the use of memory_region_dispatch_write() to work.
> > >
> > > Hence the RFC marking on that patch :)
> >
> > FWIW I had a similar issue implementing manual aliasing in one of my q800
> > patches
> > where I found that dispatching a read to a non-IO memory region didn't work
> > with
> > memory_region_dispatch_read(). The solution in my case was to switch to
> > using the
> > address space API instead, which whilst requiring an absolute address for
> > the target
> > address space, handles the dispatch correctly across all different memory
> > region types.
> >
> > Have a look at
> > https://gitlab.com/mcayland/qemu/-/commit/318e12579c7570196187652da13542db86b8c722
> > to
> > see how I did this in macio_alias_read().
> >
> > IIRC from my experiments in this area, my conclusion was that
> > memory_region_dispatch_read() can only work correctly if mapping directly
> > between 2
> > IO memory regions, and for anything else you need to use the address space
> > API.
>
> Hi Mark,
>
> I'd wondered about the address space API as an alternative approach.
>
> From that reference looks like you have the memory mapped into the system
> address
> space and are providing an alias to that. That's something I'd ideally like
> to
> avoid doing as there is no meaningful way to do it so I'd just be hiding the
> memory
> somewhere up high. The memory should only be accessible through the one
> route.
>
> I think I could spin a separate address space for this purpose (one per CXL
> type 3
> device probably) but that seems like another nasty hack to make. I'll try a
> quick
> prototype of this tomorrow.
Turned out to be trivial so already done. Will send out as v8 unless anyone
feeds back that there is a major disadvantage to just spinning up one address
space
per CXL type3 device. That will mean dropping the RFC patch as well as no
longer
used :)
Thanks for the hint Mark.
Jonathan
>
> What do people think is the least horrible way to do this?
>
> Thanks for the suggestion and I'm glad I'm not the only one trying to get this
> sort of thing to work ;)
>
> Jonathan
>
> >
> >
> > ATB,
> >
> > Mark.
> >
>
- [PATCH v7 45/46] pci-bridge/cxl_downstream: Add a CXL switch downstream port, (continued)
- [PATCH v7 45/46] pci-bridge/cxl_downstream: Add a CXL switch downstream port, Jonathan Cameron, 2022/03/06
- [PATCH v7 46/46] cxl/cxl-host: Support interleave decoding with one level of switches., Jonathan Cameron, 2022/03/06
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Michael S. Tsirkin, 2022/03/06
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Jonathan Cameron, 2022/03/07
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Peter Xu, 2022/03/09
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Jonathan Cameron, 2022/03/09
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Peter Xu, 2022/03/10
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Jonathan Cameron, 2022/03/16
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Mark Cave-Ayland, 2022/03/16
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Jonathan Cameron, 2022/03/16
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support,
Jonathan Cameron <=
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Mark Cave-Ayland, 2022/03/17
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Jonathan Cameron, 2022/03/17
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Mark Cave-Ayland, 2022/03/18
- Re: [PATCH v7 00/46] CXl 2.0 emulation Support, Jonathan Cameron, 2022/03/18