qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Overlapping buffers in I/O requests


From: Stefan Hajnoczi
Subject: [Qemu-devel] Overlapping buffers in I/O requests
Date: Thu, 30 Sep 2010 13:00:47 +0100

There is a block I/O corner case that I don't fully understand.  I'd
appreciate thoughts on the expected behavior.

At one point during a Windows Server 2008 install to an IDE disk the
guest sends a read request with overlapping sglist buffers.  It looks
like this:
[0] addr=A len=4k
[1] addr=B len=4k
[2] addr=C len=4k
[3] addr=B len=4k

Buffers 1 and 3 are the same guest memory, their addresses match.

If I understand correctly, IDE will perform each operation in turn and
DMA the result back to the buffers in order.  Therefore disk contents
at +12k should be written to address B.

Unfortunately QEMU does not guarantee this today.  Sometimes the disk
contents at +4k (buffer 1) are read and other times the disk contents
at +12k (buffer 3) are read.

QEMU can be taken out of the picture and replaced by a simple test
program that calls preadv(2) directly with the same overlapping buffer
pattern.  There doesn't appear to be a guarantee that the disk
contents at +12k (buffer 3) will be read instead of +4k (buffer 1).

When the page cache is active preadv(2) produces consistent results.

When the page cache is bypassed (O_DIRECT) preadv(2) produces
consistent results against a physical disk:
               a-22904 [001]  3042.186790: block_bio_queue: 8,0 R 2048 + 32 [a]
               a-22904 [001]  3042.186807: block_getrq: 8,0 R 2048 + 32 [a]
               a-22904 [001]  3042.186812: block_plug: [a]
               a-22904 [001]  3042.186816: block_rq_insert: 8,0 R 0 ()
2048 + 32 [a]
               a-22904 [001]  3042.186822: block_unplug_io: [a] 1
               a-22904 [001]  3042.186829: block_rq_issue: 8,0 R 0 ()
2048 + 32 [a]
 pam-foreground--22912 [001]  3042.187066: block_rq_complete: 8,0 R ()
2048 + 32 [0]

Notice that a single 32 sector read is issued on /dev/sda (8,0).  This
makes sense under the assumption that the disk honors DMA buffer
ordering within a request.

However, when the page cache is bypassed preadv(2) produces
inconsistent results against a file on ext3 -> LVM -> dm-crypt ->
/dev/sda.
               a-22834 [001]  3038.425802: block_bio_queue: 254,3 R
32616672 + 8 [a]
               a-22834 [001]  3038.425812: block_remap: 254,0 R
58544736 + 8 <- (254,3) 32616672
               a-22834 [001]  3038.425813: block_bio_queue: 254,0 R
58544736 + 8 [a]
      kcryptd_io-379   [001]  3038.425832: block_remap: 8,0 R 59044807
+ 8 <- (8,2) 58546792
      kcryptd_io-379   [001]  3038.425833: block_bio_queue: 8,0 R
59044807 + 8 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425841: block_getrq: 8,0 R 59044807
+ 8 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425845: block_plug: [kcryptd_io]
      kcryptd_io-379   [001]  3038.425848: block_rq_insert: 8,0 R 0 ()
59044807 + 8 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425859: block_rq_issue: 8,0 R 0 ()
59044807 + 8 [kcryptd_io]
               a-22834 [001]  3038.425894: block_bio_queue: 254,3 R
32616792 + 16 [a]
               a-22834 [001]  3038.425898: block_remap: 254,0 R
58544856 + 16 <- (254,3) 32616792
               a-22834 [001]  3038.425899: block_bio_queue: 254,0 R
58544856 + 16 [a]
      kcryptd_io-379   [001]  3038.425908: block_remap: 8,0 R 59044927
+ 16 <- (8,2) 58546912
      kcryptd_io-379   [001]  3038.425909: block_bio_queue: 8,0 R
59044927 + 16 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425911: block_getrq: 8,0 R 59044927
+ 16 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425913: block_plug: [kcryptd_io]
      kcryptd_io-379   [001]  3038.425914: block_rq_insert: 8,0 R 0 ()
59044927 + 16 [kcryptd_io]
               a-22834 [001]  3038.425920: block_bio_queue: 254,3 R
32616992 + 8 [a]
               a-22834 [001]  3038.425922: block_remap: 254,0 R
58545056 + 8 <- (254,3) 32616992
               a-22834 [001]  3038.425923: block_bio_queue: 254,0 R
58545056 + 8 [a]
               a-22834 [001]  3038.425929: block_unplug_io: [a] 0
               a-22834 [001]  3038.425930: block_unplug_io: [a] 0
               a-22834 [001]  3038.425931: block_unplug_io: [a] 2
               a-22834 [001]  3038.425934: block_rq_issue: 8,0 R 0 ()
59044927 + 16 [a]
      kcryptd_io-379   [001]  3038.425948: block_remap: 8,0 R 59045127
+ 8 <- (8,2) 58547112
      kcryptd_io-379   [001]  3038.425949: block_bio_queue: 8,0 R
59045127 + 8 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425951: block_getrq: 8,0 R 59045127
+ 8 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425953: block_plug: [kcryptd_io]
      kcryptd_io-379   [001]  3038.425954: block_rq_insert: 8,0 R 0 ()
59045127 + 8 [kcryptd_io]
          <idle>-0     [001]  3038.427414: block_unplug_timer: [swapper] 3
       kblockd/1-21    [001]  3038.427437: block_unplug_io: [kblockd/1] 3
       kblockd/1-21    [001]  3038.427440: block_rq_issue: 8,0 R 0 ()
59045127 + 8 [kblockd/1]
          <idle>-0     [000]  3038.436786: block_rq_complete: 8,0 R ()
59044807 + 8 [0]
         kcryptd-380   [001]  3038.436960: block_bio_complete: 254,0 R
58544736 + 8 [0]
         kcryptd-380   [001]  3038.436963: block_bio_complete: 254,3 R
32616672 + 8 [0]
          <idle>-0     [001]  3038.437070: block_rq_complete: 8,0 R ()
59044927 + 16 [0]
         kcryptd-380   [000]  3038.437343: block_bio_complete: 254,0 R
58544856 + 16 [611733513]
         kcryptd-380   [000]  3038.437346: block_bio_complete: 254,3 R
32616792 + 16 [-815025730]
          <idle>-0     [000]  3038.437428: block_rq_complete: 8,0 R ()
59045127 + 8 [0]
         kcryptd-380   [000]  3038.437569: block_bio_complete: 254,0 R
58545056 + 8 [-2107963545]
         kcryptd-380   [000]  3038.437571: block_bio_complete: 254,3 R
32616992 + 8 [176593183]

The 32 sectors are broken up into 8, 8, and 16 sector requests.  I
believe the filesystem is doing this before LVM is reached.  This
makes sense since a file may not be contiguous on disk and several
extents need to be read.

These 3 independent requests can complete in any order.  The order
will affect what contents are visible at address B when the read
completes.

So now my question:

Is QEMU risking data corruption when buffers overlap?  If IDE
guarantees that buffers are read in order then we are doing it wrong
(at least when O_DIRECT is used).

Perhaps there is no ordering guarantee in IDE, Windows is doing
something crazy, and QEMU is within its writes to use preadv(2) like
this.

Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]