[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: cut -b on huge files
From: |
Bob Proulx |
Subject: |
Re: cut -b on huge files |
Date: |
Wed, 8 Oct 2008 14:09:34 -0600 |
User-agent: |
Mutt/1.5.13 (2006-08-11) |
Klein, Roger wrote:
> I am using cut in an awkward situation: I got huge files that for any
> reason show larger file sizes than they actually have.
Those files are probably sparse files. Sparse files can be created by
using lseek(2) to seek to a different part of the file and writing
data. The result is a file with data at different locations but with
a gap between them. When this is stored the filesystem can take
advantage of this by storing the gap in the file so that the file
consumes fewer disk blocks than if the gap were written too.
For example you can use dd to create a sparse file:
dd bs=1 seek=1G if=/dev/null of=big
That will have an apparent size of 1G but will actually consume almost
no actual disk space.
> 'du' reports the correct sizes b.t.w.:
> # du -k boot_image.clone2fs
> 56740 boot_image.clone2fs
'du' reports the disk usage of the file. This value may be smaller
than the size of the file.
Try using the --apparent-size option.
du -k --apparent-size boot_image.clone2fs
> Now I found a hint on the Web
> (http://www.programmersheaven.com/mb/linux/187697/245244/re-how-to-change-filesize-in-linux/?S=B20000)
> for how the change the incorrect filesize by using cut to take over
> only a given amount of bytes into a new file: cut -b 1-500 oldFile > newFile
Of course that will read every byte and write every byte and the
result will no longer be sparse, assuming that the input file was
sparse. I don't think truncating the file is really what you want to
be doing here. If you really want to flatten the file then simply
copying it would seem to be better.
cp --sparse=never file1 file2
or
cat file1 > file2
> I never tried it on short files, but when I use this on the above file I
> get a very different result than expected:
> # cut -b 1-58101760 boot_image.clone2fs > boot_image.clone2fs_correct
Won't you need those bytes at the end of the file that you are
removing? I wouldn't expect this to be good. The ending part of the
file will be removed! I expect that you will be needing those bytes
at some point.
> # stat boot_image.clone2fs_correct
> File: `boot_image.clone2fs_correct'
> Size: 309987280 Blocks: 606048 IO Block: 4096 regular file
For what it is worth those numbers don't seem to be right to me
either. If the original stat shows 1077411840 bytes then that is the
correct size that I would hope to see in any copy.
> The number of blocks and the apparent size is all but correct now.
Try comparing the two files.
cmp boot_image.clone2fs boot_image.clone2fs_correct
If they don't compare then I believe that you have corrupted the file.
> To me this looks like a typical overflow problem. Could you please
> investigate this?
I think your problem is understanding the difference between the
file size and the disk space consumed to hold it.
du --apparent-size
ls -l
stat size
wc -c
...most normal commands...
Versus:
du
stat blocks
Try this experiment:
rm -f big big2
dd bs=1 seek=1M if=/dev/null of=big
cat big > big2
wc -c big big2
cmp big big2
ls -log big big2
du big big2
du --apparent-size big big2
Hope this helps,
Bob