[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
add regexp-split: a summary and new proposal
From: |
Daniel Hartwig |
Subject: |
add regexp-split: a summary and new proposal |
Date: |
Sat, 31 Dec 2011 13:54:31 +0800 |
An attempt to summarize the pertinent points of the thread [1].
[1] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00241.html
* Semantics, generally
`regexp-split' is similar to `string-split'. However, between
various implementations the semantics vary over the following two
points. It is important to consider appropriate compatability with
these other implementations whilst still offering the user a good
set of functionality.
* Captured groups
The Python [2] implementation contains unique semantics whereby the
text of any captured groups in the pattern are included in the
result:
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
This is considered useful functionality to have [3], though not
necesarily by default. Consider a simple parser [4] which will need
access to the tokens for processing.
Other implementations such as Racket [3], Chicken [5], and Perl do not
return the captured groups in their result.
If there were two separate functions (or one function with an
optional argument controlling the output) then the user could have a
single regexp perform both the task of just splitting and the task
of extracting the tokens. [6]
[2] http://docs.python.org/library/re.html#re.split
[3] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00257.html
[4] http://80.68.89.23/2003/Oct/26/reSplit/
[5] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00249.html
[6] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00266.html
* Empty strings
Some implementations (e.g. Chicken and Perl) drop (some) empty
strings from their result. In the case of Perl this is likely due
to making things "nice" for the user in the majority case, but it is
hard to revert this. [7]
As per the example of `string-split', having empty strings in the
result is useful to keep track of which "field" is which.
In Scheme, if the empty strings are not desired, it is trivial to
remove them:
(filter (negate string-null?) lst)
[7] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00269.html
* Naming
> Also, to me the name seems unintuitive -- it is STR being split, not
> RE -- perhaps this can be folded in to the existing string-split
> function.
[8] http://lists.gnu.org/archive/html/guile-devel/2011-12/msg00245.html
Hopefully I have not missed out anything important :-)
Anyway, what do people think of this proposal which tries to address
that whole discussion:
* [Vanilla `string-split' expanded to support the CHAR_PRED
semantics of `string-index' et al.]
* New function `string-explode' similar to `string-split' but returns
the deliminators in it's result.
* Regex module replaces both of these with regexp-enhanced versions.
Thus:
scheme@(guile-user)> ;; with a char predicate
scheme@(guile-user)> (string-split "123+456*/" (negate char-numeric?))
$8 = ("123" "456" "" "")
scheme@(guile-user)> (string-explode "123+456*/" (negate char-numeric?))
$9 = ("123" "+" "456" "*" "" "/" "")
scheme@(guile-user)> ;; with a regular expression
scheme@(guile-user)> (use-modules (ice-9 regex))
scheme@(guile-user)> (define rx (make-regexp "([^0-9])"))
scheme@(guile-user)> (string-split "123+456*/" rx)
$10 = ("123" "456" "" "")
scheme@(guile-user)> ;; didn't want empty strings
scheme@(guile-user)> (filter (negate string-null?) $10)
$11 = ("123" "456")
scheme@(guile-user)> (string-explode "123+456*/" rx)
$12 = ("123" "+" "456" "*" "" "/" "")
and so on.
I'm happy to throw together a patch for the above, however, would like
some feedback first :-)
Regards
- add regexp-split: a summary and new proposal,
Daniel Hartwig <=