[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] virtio-scsi and error handling
From: |
Stefan Hajnoczi |
Subject: |
Re: [Qemu-devel] virtio-scsi and error handling |
Date: |
Wed, 12 Jun 2013 09:56:20 +0200 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Tue, Jun 11, 2013 at 01:41:38PM +0200, Hannes Reinecke wrote:
> I currently playing around with improving SCSI EH, optimizing
> command aborts and the like.
>
> And, supposing it to be a nice testbed, tried to make things work
> with virtio_scsi.
>
> However, looking at the code there I've found virtscsi_tmf() just
> uses 'wait_for_completion', with no timeout specified. So in effect
> any abort might stall forever.
>
> Wouldn't it be more sensible to use 'wait_for_completion_timeout'
> here, to allow the error escalation to continue?
> This would especially be useful when running with multipathing,
> as the underlying device might stall, and aio_cancel() doesn't work
> reliably, if at all.
Hi,
I agree that we need a timeout. bdrv_aio_cancel() is not guaranteed to
complete in bounded time.
> Also I've found that there is no host reset. Currently the virtio
> semantics seem to require reliable communication, ie for every
> command send there _has_ to be a response.
>
> Long and painful experience with RAID HBAs has shown that this model
> works okay for the lower-level escalations, but you absolutely need
> a host reset to restore communication.
> In the case of virtio I would think that a virtio-level reset for
> host_reset would be a sensible idea.
One thing to watch out for is that a virtio-scsi reset will likely hang
too because it resets all pending requests.
Paolo Bonzini has done the lion's share of virtio-scsi work over the
past year (or two?). He might have some more thoughts.
Stefan