[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: add regexp-split: a summary and new proposal

From: Eli Barzilay
Subject: Re: add regexp-split: a summary and new proposal
Date: Sat, 31 Dec 2011 02:30:21 -0500

An hour ago, Daniel Hartwig wrote:
> Anyway, what do people think of this proposal which tries to address
> that whole discussion:
> * [Vanilla `string-split' expanded to support the CHAR_PRED
>   semantics of `string-index' et al.]
> * New function `string-explode' similar to `string-split' but returns
>   the deliminators in it's result.
> * Regex module replaces both of these with regexp-enhanced versions.

Aha -- I was looking for a new name, and `-explode' sounds good and
not misleading like `-split' (misleading in that I wouldn't have
expected a "split" function to return stuff from the gaps).

But there's one more point that bugs me about the python thing: the
resulting list has both the matches and the non-matching gaps, and
knowing which is which is tricky.  For example, if you do this (I'll
use our syntax here, so note the minor differences):

  (define (foo rx)
    (regexp-split rx "some string"))

then you can't tell which is which in its output without knowing how
many grouping parens are in the input regexp.  It therefore makes
sense to me to have this instead:

  > (regexp-explode #rx"([^0-9])" "123+456*/")
  '("123" ("+") "456" ("*") "" ("/") "")

and now it's easy to know which is which.  This is of course a simple
example with a single group so it doesn't look like much help, but
when with more than one group things can get confusing otherwise: for
example, in python you can get `None's in the result:

  >>> re.split('([^0-9](4)?)', '123+456*/')
  ['123', '+4', '4', '56', '*', None, '', '/', None, '']

but with the above, this becomes:

  > (regexp-explode #rx"([^0-9](4)?)" "123+456*/")
  '("123" ("+4" "4") "456" ("*" #f) "" ("/" #f) "")

so you can rely on the odd-numbered elements to be strings.  This is
probably going to be different for you, since you allow string
predicates instead of regexps.

Finally, the Racket implementation will probably be a little different
still -- our `regexp-match' returns a list with the matched substring
first, and then the matches for the capturing groups.  Following this,
a more uniform behavior for a `regexp-explode' would be to return
these lists, so we'd actually get:

  > (regexp-explode #rx"[^0-9]" "123+456*/")
  '("123" ("+") "456" ("*") "" ("/") "")
  > (regexp-explode #rx"([^0-9])" "123+456*/")
  '("123" ("+" "+") "456" ("*" "*") "" ("/" "/") "")

And again, this looks silly in this simple example, but would be more
useful in more complex ones.  We would also have a similar
`regexp-explode-positions' function that returns position pairs for
cases where you don't want to allocate all substrings.

One last not-too-related note: this is IMO all a by-product of a bad
choice of common regexp practices where capturing groups always refer
to the last match only.  In a world that would have made a better
choice, I'd expect:

  > (regexp-match #rx"(foo+)+ bar" "blah foofoooo bar")
  '("foofoooo bar" ("foo" "foooo"))

and, of course:

  > (regexp-match #rx"(fo(o)+)+ bar" "blah foofoooo bar")
  '("foofoooo bar" (("foo" ("o")) ("foooo" ("o" "o" "o"))))

But my guess is that many people wouldn't like that much...  (Probably
similar to disliking sexprs which are needed for the results of these
things.)  With such a thing, many of these additional constructs
wouldn't be necessary -- for exampe, we have `regexp-match*' that
returns all matches, and that wouldn't have been necessary.
`regexp-split' would probably not have been necessary too.

          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                             Maze is Life!

reply via email to

[Prev in Thread] Current Thread [Next in Thread]