bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Suggestion for enhancement


From: summer
Subject: Suggestion for enhancement
Date: Fri, 5 Aug 2005 12:05:15 +0800

I have a need that's not satisfied with split, but which split could do.
At present, I'm doing it in Perl (and that's probably adequate for what
I need to do).

Contemplate a document such as one returned by this command:
wget -O /tmp/houses.html \
        
"http://www.realestate.com.au/cgi-bin/rsearch?a=qfp&cat=House&p=200&s=wa&o=p&t=res&id=6285}";

That will produce a list of homes available for purchase in a region
become famous for Fine Wines.

The document has a heap of junk top and bottom, and some identifiable
material between the descriptions of individual homes.

What I suggest is that split be enhanced to provide a means of splitting
based on content. At present, it can do so to a limited extent (on end
of line).

I suggest something like this:
split --regex "<some regular expression>" <input> <prefix>
For example
split --regex="/begin [^>]*PropertyRow.html/i" \
        /tmp/houses.html /tmp/split-houses

(I think that regex doesn't actually work now, the site changed its
format a little).

One could consider what to do with the text used to split: it might be
appropriate to allow the user some choice there. For my purposes,
dropping is fine but in other cases it might be necessary to preserve
it in one or both files.

Consider too, the possibility of allowing users control over the output
file's name by something like this:
split --regex="/begin [^>]*PropertyRow.html/i" \
        /tmp/houses.html --output="/tmp/split-houses-%3d.html"

where % would introduce variable text much as in C:
%s for a string
%4s for a four-character string. I'm not keen on leading spaces to fill
        the four characters, but that's an implementation detail.
%d for a decimal number
%3d for three decimal digits (leading zeros).



--


Cheers
John Summerfield.

Pics http://portgeographe.environmentaldisasters.cds.merseine.nu/





reply via email to

[Prev in Thread] Current Thread [Next in Thread]