On Tue, Sep 19, 2017 at 12:09:06PM +0200, Nicolas Ecarnot wrote:
Hello,
First post here, so maybe I should introduce myself :
- I'm a sysadmin for decades and currently managing 4 oVirt clusters, made
out of tens of hypervisors, all are CentOS 7.2+ based.
- I'm very happy with this solution we choose especially because it is based
on qemu-kvm (open source, reliable, documented).
On one VM, we experienced the following :
- oVirt/vdsm is detecting an issue on the image
- following this hints https://access.redhat.com/solutions/1173623, I
managed to detect one error and fix it
- the VM is now running perfectly
On two other VMs, we experienced a similar situation, except the check stage
is showing something like 14000+ errors, and the relevant logs are :
Repairing refcount block 14 is outside image
ERROR could not resize image: Invalid argument
ERROR cluster 425984 refcount=0 reference=1
ERROR cluster 425985 refcount=0 reference=1
[... repeating the previous line 7000+ times...]
ERROR cluster 457166 refcount=0 reference=1
Rebuilding refcount structure
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device
Please run strace qemu-img info /the/relevant/logical/volume/path. It
will print all the syscalls that qemu-img makes. That way we'll be able
to verify that the ENOSPC error is coming from a pwritev syscall.
You surely know that oVirt/RHEV is storing its qcow2 images in dedicated
logical volumes.
pvs/vgs/lvs are all showing there is plenty of space available, so I
understand that I don't understand what "No space left on device" means.
After you have the strace data you can look at the file offset from the
failing pwritev syscall and check that it's really within the LV.
I think there is no fancy thin provisioning going on at the LVM level
with oVirt, but if there is then perhaps a write within the LV could
still result in an ENOSPC error. It would be worth confirming that
these are class "thick" LVs.