Re: [Qemu-block] [RFC 1/1] ide: bug #1777315: io_buffer

qemu-block
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [RFC 1/1] ide: bug #1777315: io_buffer_size and sg.size

From:	Amol Surati
Subject:	Re: [Qemu-block] [RFC 1/1] ide: bug #1777315: io_buffer_size and sg.size can represent partial sector sizes
Date:	Wed, 20 Jun 2018 06:57:42 +0530
User-agent:	Mutt/1.10.0 (2018-05-17)
On Wed, Jun 20, 2018 at 06:23:19AM +0530, Amol Surati wrote:
> On Tue, Jun 19, 2018 at 05:43:52PM -0400, John Snow wrote:
> > 
> > 
> > On 06/19/2018 05:26 PM, Amol Surati wrote:
> > > On Tue, Jun 19, 2018 at 08:04:03PM +0530, Amol Surati wrote:
> > >> On Tue, Jun 19, 2018 at 09:45:15AM -0400, John Snow wrote:
> > >>>
> > >>>
> > >>> On 06/19/2018 04:53 AM, Kevin Wolf wrote:
> > >>>> Am 19.06.2018 um 06:01 hat Amol Surati geschrieben:
> > >>>>> On Mon, Jun 18, 2018 at 08:14:10PM -0400, John Snow wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>> On 06/18/2018 02:02 PM, Amol Surati wrote:
> > >>>>>>> On Mon, Jun 18, 2018 at 12:05:15AM +0530, Amol Surati wrote:
> > >>>>>>>> This patch fixes the assumption that io_buffer_size is always a 
> > >>>>>>>> perfect
> > >>>>>>>> multiple of the sector size. The assumption is the cause of the 
> > >>>>>>>> firing
> > >>>>>>>> of 'assert(n * 512 == s->sg.size);'.
> > >>>>>>>>
> > >>>>>>>> Signed-off-by: Amol Surati <address@hidden>
> > >>>>>>>> ---
> > >>>>>>>
> > >>>>>>> The repository https://github.com/asurati/1777315 contains a module 
> > >>>>>>> for
> > >>>>>>> QEMU's 8086:7010 ATA controller, which exercises the code path
> > >>>>>>> described in [RFC 0/1] of this series.
> > >>>>>>>
> > >>>>>>
> > >>>>>> Thanks, this made it easier to see what was happening. I was able to
> > >>>>>> write an ide-test test case using this source as a guide, and 
> > >>>>>> reproduce
> > >>>>>> the error.
> > >>>>>>
> > >>>>>> static void test_bmdma_partial_sector_short_prdt(void)
> > >>>>>> {
> > >>>>>>     QPCIDevice *dev;
> > >>>>>>     QPCIBar bmdma_bar, ide_bar;
> > >>>>>>     uint8_t status;
> > >>>>>>
> > >>>>>>     /* Read 2 sectors but only give 1 sector in PRDT */
> > >>>>>>     PrdtEntry prdt[] = {
> > >>>>>>         {
> > >>>>>>             .addr = 0,
> > >>>>>>             .size = cpu_to_le32(0x200),
> > >>>>>>         },
> > >>>>>>         {
> > >>>>>>             .addr = 512,
> > >>>>>>             .size = cpu_to_le32(0x44 | PRDT_EOT),
> > >>>>>>         }
> > >>>>>>     };
> > >>>>>>
> > >>>>>>     dev = get_pci_device(&bmdma_bar, &ide_bar);
> > >>>>>>     status = send_dma_request(CMD_READ_DMA, 0, 2,
> > >>>>>>                               prdt, ARRAY_SIZE(prdt), NULL);
> > >>>>>>     g_assert_cmphex(status, ==, 0);
> > >>>>>>     assert_bit_clear(qpci_io_readb(dev, ide_bar, reg_status), DF | 
> > >>>>>> ERR);
> > >>>>>>     free_pci_device(dev);
> > >>>>>> }
> > >>>>>>
> > >>>>>>> Loading the module reproduces the bug. Tested on the latest master
> > >>>>>>> branch.
> > >>>>>>>
> > >>>>>>> Steps:
> > >>>>>>> - Install a Linux distribution as a guest, ensuring that the boot 
> > >>>>>>> disk
> > >>>>>>>   resides on non-IDE controllers (such as virtio)
> > >>>>>>> - Attach another disk as a master device on the primary
> > >>>>>>>   IDE controller (i.e. attach at -hda.)
> > >>>>>>> - Blacklist ata_piix, pata_acpi and ata_generic modules, and reboot.
> > >>>>>>> - Copy the source files into the guest and build the module.
> > >>>>>>> - Load the module. QEMU process should die with the message:
> > >>>>>>>   qemu-system-x86_64: hw/ide/core.c:871: ide_dma_cb:
> > >>>>>>>   Assertion `n * 512 == s->sg.size' failed.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> -Amol
> > >>>>>>>
> > >>>>>>
> > >>>>>> I'm less sure of the fix -- certainly the assert is wrong, but just
> > >>>>>> incrementing 'n' is wrong too -- we didn't copy (n+1) sectors, we 
> > >>>>>> copied
> > >>>>>> (n) and a few extra bytes.
> > >>>>>
> > >>>>> That is true.
> > >>>>>
> > >>>>> There are (at least) two fields that represent the total size of a DMA
> > >>>>> transfer -
> > >>>>> (1) The size, as requested through the NSECTOR field.
> > >>>>> (2) The size, as calculated through the length fields of the PRD 
> > >>>>> entries.
> > >>>>>
> > >>>>> It makes sense to consider the most restrictive of the sizes, as the 
> > >>>>> factor
> > >>>>> which determines both the end of a successful DMA transfer and the
> > >>>>> condition to assert.
> > >>>>>
> > >>>>>>
> > >>>>>> The sector-based math here would need to be adjusted to be able to 
> > >>>>>> cope
> > >>>>>> with partial sector reads... or we ought to avoid doing any partial
> > >>>>>> sector transfers.
> > >>>>>>
> > >>>>>>
> > >>>>>> I'm not sure which is more correct tonight, it depends:
> > >>>>>>
> > >>>>>> - If it's OK to transfer partial sectors before reporting overflow,
> > >>>>>> adjusting the command loop to work with partial sectors is OK.
> > >>>>>>
> > >>>>>> - If it's NOT OK to do partial sector transfer, the sglist 
> > >>>>>> preparation
> > >>>>>> phase needs to produce a truncated SGList that's some multiple of 512
> > >>>>>> bytes that leaves the excess bytes in a second sglist that we don't
> > >>>>>> throw away and can use as a basis for building the next sglist. (Or 
> > >>>>>> the
> > >>>>>> DMA helpers need to take a max_bytes parameter and return an sglist
> > >>>>>> representing unused buffer space if the command underflowed.)
> > >>>>>
> > >>>>> Support for partial sector transfers is built into the DMA 
> > >>>>> interface's PRD
> > >>>>> mechanism itself, because an entry is allowed to transfer in the 
> > >>>>> units of
> > >>>>> even number of bytes.
> > >>>>>
> > >>>>> I think the controller's IO process runs in two parts (probably loops 
> > >>>>> over
> > >>>>> for a single transfer):
> > >>>>>
> > >>>>> (1) The controller's disk interface transfers between its internal 
> > >>>>> buffer
> > >>>>>     and the disk storage. The transfers are likely to be in the
> > >>>>>     multiples of a sector.
> > >>>>> (2) The controller's DMA interface transfers between its internal 
> > >>>>> buffer
> > >>>>>     and the system memory. The transfers can be sub-sector in size(, 
> > >>>>> and
> > >>>>>     are preserving of the areas, of the internal buffer, not subject 
> > >>>>> to a
> > >>>>>     write.)
> > >>>>
> > >>>> The spec isn't clear about this (or at least I can't find anything 
> > >>>> where
> > >>>> the exact behaviour is specified), but I agree that that's my mental
> > >>>> model as well. So I would make IDE send a byte-granularity request with
> > >>>> the final partial sector to the block layer, so that the data is
> > >>>> actually transferred up to that point.
> > >>>>
> > >>>> In practice it probably doesn't matter much because a too short PRDT
> > >>>> means that the request doesn't complete successfully (the condition is
> > >>>> indicated by clear Interrupt, Active and Error flags in the BMDMA
> > >>>> controller) and I suppose the guest won't actually look at the data
> > >>>> then.
> > >>>>
> > >>>> Providing the data anyway (consistent with our assumption how real
> > >>>> hardware works) is erring on the safe side because it doesn't hurt a
> > >>>> reasonable guest that did not expect the data to be transferred in this
> > >>>> condition.
> > >>>>
> > >>>> Kevin
> > >>>>
> > >>>
> > >>> Partial transfers it is, since I didn't see anything in AHCI or BMDMA
> > >>> that suggested we shouldn't and it seems like the actual easiest fix
> > >>> because it avoids having to modify the sglists or the sglist preparation
> > >>> functions.
> > >>>
> > >>> Amol, would you like to author a fix, or would you prefer that I do it?
> > >>
> > >> Yes, I would like to author it. I assume that the simplification of the
> > >> calls to prepare_buf is better kept as a change that is separate from
> > >> this fix.
> > >>
> > >>>
> > >>> If you do, please copy my ide-test case above and check it in as patch
> > >>> 2/2 to your series as a regression test (tests/ide-test.c). It may need
> > >>> some further editing after the command submission to pass; I only made
> > >>> sure it crashes QEMU.
> > >>
> > >> Will do.
> > >>
> > >> Thanks,
> > >> -amol
> > >>
> > >>>
> > >>> Thanks,
> > >>> --js
> > > 
> > > After ide_dma_cb checks the return value from prepare_buf, it comments
> > > thus:
> > > 
> > > "The PRDs were too short. Reset the Active bit, but don't raise an
> > > interrupt."
> > > 
> > > So, QEMU already has a policy in place for short PRDs. I apologize,
> > > that I did not notice it earlier.
> > > 
> > 
> > Any sane human would be forgiven for not being able to read this function.
> 
> :D
> 
> > 
> > > It is enforced by the condition:
> > > "(s->bus->dma->ops->prepare_buf(s->bus->dma, s->io_buffer_size) < 512)".
> > > 
> > > The definition of short PRDs it uses is: those which result in a
> > > "< 512" bytes transfer.
> > > 
> > > Is there a reason to not define them, as those which result in a
> > > "< n * 512" bytes transfer, to begin with (or now)?
> > > > (n = s->nsector).
> > 
> > ide_dma_cb is written to possibly execute more than once per command. It
> > tries to gather up what it can from prepare_buf, but as long as it gets
> > ENOUGH sglist to transfer "at least one sector" it goes ahead with the
> > read, under the idea that maybe when we return to ide_dma_cb, the next
> > call to prepare_buf will give it more space to transfer the rest of the
> > data to.
> > 
> > This is why it checks for < 512 instead of < (s->nsectors * 512).
> > "As long as I can transfer a sector's worth, I can make progress on the
> > command."
> > 
> > As we now know, this mechanism is faulty because we don't always build
> > sglists that are multiples of 512. A way to fix this would be to use
> > s->sg.size as an offset when building the next transfer instead of (n *
> > 512).
> 
> Understood. For legacy or other reasons, QEMU supports piece-wise r/w
> cmds, where each piece is assumed to be a multiple of a sector.
> 
> That requires maintaining state (such as bm->cur_addr) across multiple
> calls to ide_dma_cb (and to prepare_buf) both inside QEMU (that is already
> in place for those reasons) and inside the guest. That also requires
> providing interrupts to signal state transitions. I am assuming such
> transitions, since the design too does in some form.
> 
> > 
> > However, I am pretty sure that for both pci bmdma and ahci, we are
> > guaranteed to have the entire sglist from the guest before reaching
> > ide_dma_cb, so we might be able to adjust this loop to understand that
> > prepare_buf will never yield further results.
> > 
> > (...I think. I'm actually not confident, since I never wrote the
> > original DMA loop. It looks right, though.)
> > 
> > In this case, you could just change the overflow condition to check for
> > the entire transfer size and declare the overflow sooner instead of later.
> 
> To consider "< n * 512" as a fix, it is not necessary to look for the
> guarantee of receiving the entire sglist in one go on the first call to
> ide_dma_cb.
> 
> The legacy reasons are expected to behave consistently,
> whether the guest provides one PRD per state transtion (or per call to
> prepare_buf), or 2 PRDs per state transition, or all PRDs per state
> transition. (I know there might not be such a guest, but QEMU does assume
> its existence in its design.)
> 
> A guest does not expect differences in outward behaviour regardless
> of the way it provides the PRDs.
> 
> Consider the example of a command that transfers 5 consecutive sectors.
> Each line below represents a call to prepare_buf and its return value,
> also indicating the number of PRD entries the call consumed and
> their sizes.
> 
> 
> E1:
> ---
> 512
> 512
> 512
> 128 <--- No interrupt, be inactive, 3 wasted transfers.
> 512
> 384
> 
> 
> Next,
> E2:
> ---
> 512+512=1024
> 512+128=640 <--- no 'short PRD', crash, but the same command as above.
> 512+384=896
> 
> 
> Next,
> E3:
> ---
> 512+512+512+128+512+384 <--- success, but the same command as above.
> 
> 
> The behaviour is indeed guest-visibly inconsistent, but managable.
> 
> By fixing using s->sg.size as the offset for the next transfer, we bring
> consistency to all 3 cases E1, E2 and E3, but also begin managing
> granularity at 2-byte levels, and introduce extensive code changes.
> 
> 
> By fixing through "< n * 512", 
> - all instances of E2 (of which there has been only one reported) are
>   included in the set E1, while
> - E3 continues to succeed as before (since "< n * 512" is false for E3),
>   and
> - E1 continues to fail as before (since anything "< 512" is
>   necessarily "< n * 512",assuming n >= 1 and no integer oveflows.)
> There were already two types of instances before (E1 and E3); after this
> fix, they still remain two in number.
> 
> For me, this line of reasoning is sufficient to go with the "< n * 512"
> overflow fix. Let me know if there are objections - I am okay writing
> either/both of the fixes. Apologies for this turn-about - I might have
> spoken too soon in my earlier emails.
> 

But "< n * 512" fails for:

512 <---- fails right here in the first call itself
512
512
512
512

where the original implementation would succeed. That's unacceptable.

Hoping that the subsequent iterations can gather more PRDs, an
additional condition is needed.

Let retval be the return value of prepare_buf.
The failure condition needed is:

if ((retval % 512) != 0 && retval < n * 512), then drop-goto-eot

I hope that that covers all.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-block] [RFC 1/1] ide: bug #1777315: io_buffer_size and sg.size can represent partial sector sizes, (continued)
Prev by Date: Re: [Qemu-block] [RFC 1/1] ide: bug #1777315: io_buffer_size and sg.size can represent partial sector sizes
Next by Date: [Qemu-block] [PATCH 1/2] ide/hw/core: fix crash on processing a partial-sector-size DMA xfer
Previous by thread: Re: [Qemu-block] [RFC 1/1] ide: bug #1777315: io_buffer_size and sg.size can represent partial sector sizes
Next by thread: [Qemu-block] [PULL 00/35] Block layer patches
Index(es):
- Date
- Thread