qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Multi GPU passthrough via VFIO


From: Maik Broemme
Subject: Re: [Qemu-devel] Multi GPU passthrough via VFIO
Date: Fri, 16 Jan 2015 13:21:15 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

Hi Alex,

Maik Broemme <address@hidden> wrote:
> Hi Alex,
> 
> Maik Broemme <address@hidden> wrote:
> > Hi Alex,
> > 
> > Alex Williamson <address@hidden> wrote:
> > > On Fri, 2014-02-14 at 01:01 +0100, Maik Broemme wrote:
> > > > Hi Alex,
> > > > 
> > > > Maik Broemme <address@hidden> wrote:
> > > > > Hi Alex,
> > > > > 
> > > > > Alex Williamson <address@hidden> wrote:
> > > > > > On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
> > > > > > > Interesting is the diff between 1st and 2nd boot, so if I do the 
> > > > > > > lspci
> > > > > > > prior to the booting. The only difference between 1st start and 
> > > > > > > 2nd
> > > > > > > start are:
> > > > > > > 
> > > > > > > --- 001-lspci.290x.before.1st.log 2014-02-07 01:13:41.498827928 
> > > > > > > +0100
> > > > > > > +++ 004-lspci.290x.before.2nd.log 2014-02-07 01:16:50.966611282 
> > > > > > > +0100
> > > > > > > @@ -24,7 +24,7 @@
> > > > > > >                   ClockPM- Surprise- LLActRep- BwNot-
> > > > > > >           LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- 
> > > > > > > CommClk+
> > > > > > >                   ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > > > -         LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> > > > > > > DLActive- BWMgmt- ABWMgmt-
> > > > > > > +         LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> > > > > > > DLActive- BWMgmt- ABWMgmt-
> > > > > > >           DevCap2: Completion Timeout: Not Supported, 
> > > > > > > TimeoutDis-, LTR-, OBFF Not Supported
> > > > > > >           DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, 
> > > > > > > LTR-, OBFF Disabled
> > > > > > >           LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- 
> > > > > > > SpeedDis-
> > > > > > > @@ -33,13 +33,13 @@
> > > > > > >           LnkSta2: Current De-emphasis Level: -3.5dB, 
> > > > > > > EqualizationComplete-, EqualizationPhase1-
> > > > > > >                    EqualizationPhase2-, EqualizationPhase3-, 
> > > > > > > LinkEqualizationRequest-
> > > > > > >   Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> > > > > > > -         Address: 0000000000000000  Data: 0000
> > > > > > > +         Address: 00000000fee00000  Data: 0000
> > > > > > >   Capabilities: [100 v1] Vendor Specific Information: ID=0001 
> > > > > > > Rev=1 Len=010 <?>
> > > > > > >   Capabilities: [150 v2] Advanced Error Reporting
> > > > > > >           UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
> > > > > > > UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > > >           UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
> > > > > > > UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > > >           UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- 
> > > > > > > UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > > > > > > -         CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> > > > > > > NonFatalErr-
> > > > > > > +         CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> > > > > > > NonFatalErr+
> > > > > > >           CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> > > > > > > NonFatalErr+
> > > > > > >           AERCap: First Error Pointer: 00, GenCap+ CGenEn- 
> > > > > > > ChkCap+ ChkEn-
> > > > > > >   Capabilities: [270 v1] #19
> > > > > > > 
> > > > > > > After that if I do suspend-to-ram / resume trick I have again 
> > > > > > > lspci
> > > > > > > output from before 1st boot.
> > > > > > 
> > > > > > The Link Status change after X is stopped seems the most 
> > > > > > interesting to
> > > > > > me.  The MSI change is probably explained by the MSI save/restore 
> > > > > > of the
> > > > > > device, but should be harmless since MSI is disabled.  I'm a bit
> > > > > > surprised the Correctable Error Status in the AER capability didn't 
> > > > > > get
> > > > > > cleared.  I would have thought that a bus reset would have caused 
> > > > > > the
> > > > > > link to retrain back to the original speed/width as well.  Let's 
> > > > > > check
> > > > > > that we're actually getting a bus reset, try this in addition to the
> > > > > > previous qemu patch.  This just enables debug logging for the bus 
> > > > > > resest
> > > > > > function.  Thanks,
> > > > > > 
> > > > > 
> > > > > Below are the outputs from 2 boots, VGA, load fglrx and start X. (2nd
> > > > > time X gets killed and oops happened)
> > > > > 
> > > > > - 1st boot:
> > > > > 
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > vfio:         0000:01:00.0 group 1
> > > > > vfio:         0000:01:00.1 group 1
> > > > > vfio: 0000:01:00.1 hot reset: Success
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > vfio:         0000:01:00.0 group 1
> > > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > > vfio:         0000:01:00.0 group 1
> > > > > vfio:         0000:01:00.1 group 1
> > > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > > 
> > > > > - 2nd boot:
> > > > > 
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > vfio:         0000:01:00.0 group 1
> > > > > vfio:         0000:01:00.1 group 1
> > > > > vfio: 0000:01:00.1 hot reset: Success
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > vfio:         0000:01:00.0 group 1
> > > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > > vfio:         0000:01:00.0 group 1
> > > > > vfio:         0000:01:00.1 group 1
> > > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > > 
> > > > 
> > > > Did you had already a chance to look into it or anything else I can help
> > > > with?
> > > 
> > > According to the log we're doing the bus reset on both the first and 2nd
> > > boot (it's expected that only the "multi" call gets to success).  I'm
> > > surprised then that the link doesn't retrain back to the original width.
> > > You could try forcing the link to retrain.  Look at the root port
> > > upstream from the GPU, lspci -t is handy for this.  Run lspci on the
> > > root port to get the PCI express capability offset, then use setpci to
> > > set the link retrain bit.  For example:
> > > 
> > > # lspci -tv | grep NVIDIA
> > >            +-07.0-[03]--+-00.0  NVIDIA Corporation GK106GL [Quadro K4000]
> > >            |            \-00.1  NVIDIA Corporation GK106 HDMI Audio 
> > > Controller
> > > 
> > > (upstream root port is 00:07.0)
> > > 
> > > # lspci -v -s 7.0 | grep Capabilities
> > >   Capabilities: [40] Subsystem: Intel Corporation 5520/5500/X58 I/O Hub 
> > > PCI Express Root Port 7
> > >   Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
> > >   Capabilities: [90] Express Root Port (Slot+), MSI 00
> > >   Capabilities: [e0] Power Management version 3
> > >   Capabilities: [100] Advanced Error Reporting
> > >   Capabilities: [150] Access Control Services
> > >   Capabilities: [160] Vendor Specific Information: ID=0002 Rev=0 Len=00c 
> > > <?>
> > > 
> > > (PCI express capability is offset 0x90, Link Control is 0x10 off that)
> > > 
> > > # setpci -s 7.0 a0.w
> > > 0040
> > > 
> > > (retrain is bit 5, 0x20, OR'd with read value is 0x60)
> > > 
> > > # setpci -s 7.0 a0.w=60
> > > 
> > > # lspci... did it work?
> > > 
> > > Try doing that after the first boot to see if you can get back to a x16
> > > link.  If that works, we may need to add something in the kernel to do
> > > it automatically around a bus reset.  Thanks,
> > > 
> > 
> > Well this doesn't help either and it looks like VFIO reset is setting it
> > already back to original width. For example:
> > 
> >            +-02.0-[01]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] 
> > Hawaii XT [Radeon HD 8970]
> >            |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] 
> > Device aac8
> > 
> > Before 1st run:
> > 
> > address@hidden:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> >             LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> > DLActive+ BWMgmt- ABWMgmt-
> > address@hidden:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> >             LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> > DLActive- BWMgmt- ABWMgmt-
> > 
> > After power down of VM:
> > 
> > address@hidden:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> >             LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> > DLActive+ BWMgmt- ABWMgmt+
> > address@hidden:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> >             LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> > DLActive- BWMgmt- ABWMgmt-
> > 
> > After 2nd start once VFIO did reset:
> > 
> > address@hidden:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> >             LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> > DLActive+ BWMgmt- ABWMgmt+
> > address@hidden:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> >             LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> > DLActive- BWMgmt- ABWMgmt-
> > 
> > The only difference on bus I see here is ABWMgmt- vs ABWMgmt+ but it
> > shouldn't be relevant here as it the same if I unload fglrx module
> > before shutdown the VM which is the only case where I can run multiple
> > VM reboot cycles.
> > 
> > So the only difference on bus is the following:
> > 
> > -60: 10 08 00 00 02 cd 31 00 40 00 02 b1 80 25 14 00
> > +60: 10 08 00 00 02 cd 31 00 40 00 11 b0 80 25 14 00
> > 
> > 6a (before 02, after 11)
> > 6b (before b1, after b0)
> > 
> > But I cannot write these parameters using setpci. My PCI express capability
> > is offset 0x58 + 0x10 for link control which is already set back to 40
> > 
> > address@hidden:~# lspci -vvv -s 00:02.0 | grep Capa
> >     Capabilities: [50] Power Management version 3
> >     Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
> >     Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit-
> >     Capabilities: [b0] Subsystem: Gigabyte Technology Co., Ltd Device 5000
> >     Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
> >     Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 
> > Len=010 <?>
> >     Capabilities: [190 v1] Access Control Services
> > 
> 
> Wouldn't it be a possible solution to do a D0 -> D3 -> D0 transition for
> devices which doesn't support FLR? The setpci way doesn't help me at all
> 

I want to renew the thread a bit as with latest slot/bus reset some
things have changed but it still doesn't work in all cases.

#1 QEMU+OVMF (UEFI):

I've flashed my R9 290X with an UEFI compatible BIOS and QEMU+OVMF
(without CSM) boots Windows 8.1 fine. Catalyst 14.12 drivers can be
installed without issues and work fine. However an attempt to reboot the
VM result in Windows 8.1 typical "Something went wrong :(" screen. The
suspend/resume trick still works between VM reboots.

#2 QEMU (BIOS):

In this scenario I use secondary GPU passthrough (no VGA as primary
adapter) using Windows 7. Catalyst 14.12 drivers can be installed
without issues and work fine. Also I was surprised that an attempt to
reboot the VM was also working. Windows 7 restarts fine, I see the login
screen and no performance issues. But it doesn't work always, sometimes
it works for 3-4 reboots and next one fails with just a black screen
(but Windows VM is pingable and ACPI shutdown still works), sometimes it
works only for one reboot. In all cases the suspend/resume trick still
works.

So I would like to narrow down the problem. Anything I can try Alex,
like debugging logs of QEMU.

Used QEMU version is 2.2.0, kernel is 3.18.2.

> > > Alex
> > > 
> > > > > > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> > > > > > index 8db182f..7fec259 100644
> > > > > > --- a/hw/misc/vfio.c
> > > > > > +++ b/hw/misc/vfio.c
> > > > > > @@ -2927,6 +2927,10 @@ static bool 
> > > > > > vfio_pci_host_match(PCIHostDeviceAddress *hos
> > > > > >              host1->slot == host2->slot && host1->function == 
> > > > > > host2->function);
> > > > > >  }
> > > > > >  
> > > > > > +#undef DPRINTF
> > > > > > +#define DPRINTF(fmt, ...) \
> > > > > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > > > > +
> > > > > >  static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
> > > > > >  {
> > > > > >      VFIOGroup *group;
> > > > > > @@ -3104,6 +3108,15 @@ out_single:
> > > > > >      return ret;
> > > > > >  }
> > > > > >  
> > > > > > +#undef DPRINTF
> > > > > > +#ifdef DEBUG_VFIO
> > > > > > +#define DPRINTF(fmt, ...) \
> > > > > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > > > > +#else
> > > > > > +#define DPRINTF(fmt, ...) \
> > > > > > +    do { } while (0)
> > > > > > +#endif
> > > > > > +
> > > > > >  /*
> > > > > >   * We want to differentiate hot reset of mulitple in-use devices 
> > > > > > vs hot reset
> > > > > >   * of a single in-use device.  VFIO_DEVICE_RESET will already 
> > > > > > handle the case
> > > > > > 
> > > > > > 
> > > > > 
> > > > > --Maik
> > > > > 
> > > > 
> > > > --Maik
> > > 
> > > 
> > > 
> > 
> > --Maik
> > 
> 
> --Maik
> 

--Maik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]