help-octave
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....


From: David Bateman
Subject: [Changeset] Re: Aw: Re: regexp: matching expressions b4 and after ....
Date: Tue, 09 Sep 2008 18:10:49 +0200
User-agent: Thunderbird 2.0.0.16 (X11/20080725)

Ben Abbott wrote:
On Tuesday, September 09, 2008, at 09:41AM, "David Bateman" <address@hidden> 
wrote:
Grrrr, its more annoying than I thought. PCRE CAN do arbitrary length lookahead, but not arbitrary length lookbehind. Thus "(?[a-z]*)" is ok but "(?<[a-z]*)" isn't. I'd hoped to replace this with "(?<[a-z]{0,MAXLENGTH})" but the variable but not arbitrary length is not ok either. What I'd have to do is replace it with

((?<[a-z]{0})(?<[a-z]{1})...(?<[a-z]{MAXLENGTH}))

which used the alternate operator and MALENGTH+1 copies of the lookbehind expression to get the effect. This seems to be a ridiculous amount of extra crap in the pattern space to get this functionality. Is it worth supporting arbitrary length lookbehind expressions like "(?<[a-z]*)" if this is what is needed to get it to work with PCRE? Is it worth supporting it but limits max_length, and print a warning? If so what value should be the limit?

Frankly I wonder how mathworks got this to work as they appear to be using the Boost regex library which also doesn't support arbitrary length lookbehind expressions....

D.

David,

Have you tried the example in Matlab?

Using 2007b, It does *not* work for me. My 2008a/b is busy running some 
simulations, so I can't try it there until later.

g='x^(-1)+y(-1)+z(-1)=0';
regexprep(g,'(?<=[a-z]*)\(\-[1-9]*\)','\_minus1')
ans =
x^_minus1+y_minus1+z_minus1=0

If I understand correctly the result should be
ans =
x^(-1)+y_minus1+z_minus1=0

Correct?

Ben




The message

http://groups.google.com/group/comp.soft-sys.matlab/browse_thread/thread/babf37252132fd99/250b037e60b345ff?lnk=gst&q=lookbehind#250b037e60b345ff

seems to imply that mathworks have their own regexp engine and that lookbehind is inefficient. I therefore don't consider it that much of an issue to duplicate the lookbehind pattern in the pattern space and so propose the attached changeset that replaces "(?>=[a-z]*)" with "((?>=[a-z]{0})|(?>=[a-z]{1})|...(?>=[a-z]{10}))" before calling PCRE on it. It also issues a warning about the maximum length string if the lookbehind might be an issue. So the limitation is that "+" then represents 1 to 10 characters and "*" 0 to 10 characters in a lookbehind expression. This limitation doesn't apply to lookaheads, etc.

D.

--
David Bateman                                address@hidden
Motorola Labs - Paris +33 1 69 35 48 04 (Ph) Parc Les Algorithmes, Commune de St Aubin +33 6 72 01 06 33 (Mob) 91193 Gif-Sur-Yvette FRANCE +33 1 69 35 77 01 (Fax) The information contained in this communication has been classified as: [x] General Business Information [ ] Motorola Internal Use Only [ ] Motorola Confidential Proprietary

# HG changeset patch
# User David Bateman <address@hidden>
# Date 1220976632 -7200
# Node ID efb313d0d98265789218594c6d3dd5d0e19f9ce1
# Parent  917a977250d994aceb97d2bd299e9a6048ca750f
Treat PCRE lookbehind operators in a manner that is approximately correct

diff --git a/src/ChangeLog b/src/ChangeLog
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,10 @@ 2008-09-02  Michael Goffioul  <michael.g
+2008-09-09  David Bateman  <address@hidden>
+
+       * DLD-FUNCTIONS/regexp.cc (octregexp_list): Distinguish between
+       matlab named tokens and perl lookbehind expressions. For
+       lookbehind expression replace "*" and "+" with a limited number of
+       fixed length expressions to simulate arbitrary length look behind.
+
 2008-09-02  Michael Goffioul  <address@hidden>
 
        * graphics.cc (hggroup::update_axis_limits): Also reacts on
diff --git a/src/DLD-FUNCTIONS/regexp.cc b/src/DLD-FUNCTIONS/regexp.cc
--- a/src/DLD-FUNCTIONS/regexp.cc
+++ b/src/DLD-FUNCTIONS/regexp.cc
@@ -80,6 +80,9 @@ public:
 
 typedef std::list<regexp_elem>::const_iterator const_iterator;
 
+#define MAXLOOKBEHIND 10
+static bool lookbehind_warned = false;
+
 static int
 octregexp_list (const octave_value_list &args, const std::string &nm, 
                bool case_insensitive, std::list<regexp_elem> &lst, 
@@ -96,6 +99,9 @@ octregexp_list (const octave_value_list 
   once = false;
 
   std::string buffer = args(0).string_value ();
+  size_t max_length = (buffer.length () > MAXLOOKBEHIND ? 
+                      MAXLOOKBEHIND: buffer.length ());
+
   if (error_state)
     {
       gripe_wrong_type_arg (nm.c_str(), args(0));
@@ -190,12 +196,6 @@ octregexp_list (const octave_value_list 
 
       // named tokens "(?<name>...)" are only treated with PCRE not regex.
 #if HAVE_PCRE
-      // The syntax of named tokens in pcre is "(?P<name>...)" while we need
-      // a syntax "(?<name>...)", so fix that here. Also an expression like
-      // "(?<first>\w+)\s+(?<last>\w+)|(?<last>\w+),\s+(?<first>\w+)" should
-      // be perfectly legal, while pcre does not allow the same named token
-      // name on both sides of the alternative. Also fix that here by replacing
-      // name tokens by dummy names, and dealing with the dummy names later.
       
       size_t pos = 0;
       size_t new_pos;
@@ -204,44 +204,131 @@ octregexp_list (const octave_value_list 
       std::ostringstream buf;
       Array<int> named_idx;
 
-      while ((new_pos = pattern.find ("(?<",pos)) != std::string::npos)
-       {
-         size_t tmp_pos = pattern.find_first_of ('>',new_pos);
-
-         if (tmp_pos == std::string::npos)
+      while ((new_pos = pattern.find ("(?",pos)) != std::string::npos)
+       {
+         if (pattern.at (new_pos + 2) == '<' &&  
+             !(pattern.at (new_pos + 3) == '=' ||
+               pattern.at (new_pos + 3) == '!'))
            {
-             error ("syntax error in pattern");
-             break;
+             // The syntax of named tokens in pcre is "(?P<name>...)" while
+             // we need a syntax "(?<name>...)", so fix that here. Also an 
+             // expression like 
+             // "(?<first>\w+)\s+(?<last>\w+)|(?<last>\w+),\s+(?<first>\w+)" 
+             // should be perfectly legal, while pcre does not allow the same
+             // named token name on both sides of the alternative. Also fix
+             // that here by replacing name tokens by dummy names, and dealing
+             // with the dummy names later.
+
+             size_t tmp_pos = pattern.find_first_of ('>',new_pos);
+
+             if (tmp_pos == std::string::npos)
+               {
+                 error ("syntax error in pattern");
+                 break;
+               }
+
+             std::string tmp_name = 
+               pattern.substr(new_pos+3,tmp_pos-new_pos-3);
+             bool found = false;
+
+             for (int i = 0; i < nnames; i++)
+               if (named(i) == tmp_name)
+                 {
+                   named_idx.resize(inames+1);
+                   named_idx(inames) = i;
+                   found = true;
+                   break;
+                 }
+             if (! found)
+               {
+                 named_idx.resize(inames+1);
+                 named_idx(inames) = nnames;
+                 named.append(tmp_name);
+                 nnames++;
+               }
+
+             if (new_pos - pos > 0)
+               buf << pattern.substr(pos,new_pos-pos);
+             if (inames < 10)
+               buf << "(?P<n00" << inames++;
+             else if (inames < 100)
+               buf << "(?P<n0" << inames++;
+             else
+               buf << "(?P<n" << inames++;
+             pos = tmp_pos;
            }
-
-         std::string tmp_name = pattern.substr(new_pos+3,tmp_pos-new_pos-3);
-         bool found = false;
-
-         for (int i = 0; i < nnames; i++)
-           if (named(i) == tmp_name)
-             {
-               named_idx.resize(inames+1);
-               named_idx(inames) = i;
-               found = true;
-               break;
-             }
-         if (! found)
+         else if (pattern.at (new_pos + 2) == '<')
            {
-             named_idx.resize(inames+1);
-             named_idx(inames) = nnames;
-             named.append(tmp_name);
-             nnames++;
+             // Find lookbehind operators of arbitrary length (ie like 
+             // "(?<=[a-z]*)") and replace with a maximum length operator 
+             // as PCRE can not yet handle arbitrary length lookahead 
+             // operators. Use the string length as the maximum length to 
+             // avoid issues.
+
+             int brackets = 1;
+             size_t tmp_pos1 = new_pos + 2;
+             size_t tmp_pos2 = tmp_pos1;
+             while (tmp_pos1 <= pattern.length () && brackets > 0)
+               {
+                 char ch = pattern.at (tmp_pos1);
+                 if (ch == '(')
+                   brackets++;
+                 else if (ch == ')')
+                   {
+                     if (brackets > 1)
+                       tmp_pos2 = tmp_pos1;
+
+                     brackets--;
+                   }
+                 tmp_pos1++;
+               }
+
+             if (brackets != 0)
+               {
+                 buf << pattern.substr (pos, new_pos - pos) << "(?";
+                 pos = new_pos + 2;
+               }
+             else
+               {
+                 size_t tmp_pos3 = pattern.find_first_of ("*+", tmp_pos2);
+                 if (tmp_pos3 != std::string::npos && tmp_pos3 < tmp_pos1)
+                   {
+                     if (!lookbehind_warned)
+                       {
+                         lookbehind_warned = true;
+                         warning ("%s: arbitrary length lookbehind patterns 
are only support up to length %d", nm.c_str(), MAXLOOKBEHIND);
+                       }
+
+                     buf << pattern.substr (pos, new_pos - pos) << "(";
+
+                     size_t i;
+                     if (pattern.at (tmp_pos3) == '*')
+                       i = 0;
+                     else
+                       i = 1;
+
+                     for (; i < max_length + 1; i++)
+                       {
+                         buf <<pattern.substr(new_pos, tmp_pos3 - new_pos)
+                             << "{" << i << "}";
+                         buf << pattern.substr(tmp_pos3 + 1, 
+                                               tmp_pos1 - tmp_pos3 - 1);
+                         if (i != max_length)
+                           buf << "|";
+                       }
+                     buf << ")";
+                   }
+                 else
+                   buf << pattern.substr (pos, tmp_pos1 - pos);
+                 pos = tmp_pos1;
+               }
            }
-
-         if (new_pos - pos > 0)
-           buf << pattern.substr(pos,new_pos-pos);
-         if (inames < 10)
-           buf << "(?P<n00" << inames++;
-         else if (inames < 100)
-           buf << "(?P<n0" << inames++;
          else
-           buf << "(?P<n" << inames++;
-         pos = tmp_pos;
+           {
+             buf << pattern.substr (pos, new_pos - pos) << "(?";
+             pos = new_pos + 2;
+           }
+
        }
 
       buf << pattern.substr(pos);
@@ -1041,6 +1128,16 @@ The pattern is taken literally.\n\
 
%!assert(regexp({'asdfg-dfd','-dfd-dfd-','qasfdfdaq'},'-'),{6,[1,5,9],zeros(1,0)})
 
%!assert(regexp({'asdfg-dfd';'-dfd-dfd-';'qasfdfdaq'},{'-';'f';'q'}),{6;[3,7];[1,9]})
 %!assert(regexp('Strings',{'t','s'}),{2,7})
+
+## Test case for lookaround operators
+%!assert(regexp('Iraq','q(?!u)'),4)
+%!assert(regexp('quit','q(?!u)'), zeros(1,0))
+%!assert(regexp('quit','q(?=u)','match'), {'q'})
+%!assert(regexp("quit",'q(?=u+)','match'), {'q'})
+%!assert(regexp("qit",'q(?=u+)','match'), cell(1,0))
+%!assert(regexp("qit",'q(?=u*)','match'), {'q'})
+
+%!assert(regexp('thingamabob','(?<=a)b'), 9)
 
 */
 
@@ -1569,6 +1666,9 @@ Alternatively, use (?x) or (?-x) in the 
 %!assert(regexprep({"abc","cba"},"b","?"),{"a?c","c?a"})
 %!assert(regexprep({"abc","cba"},{"b","a"},{"?","!"}),{"!?c","c?!"})
 
+# Nasty lookbehind expression
+%!assert(regexprep('x^(-1)+y(-1)+z(-1)=0','(?<=[a-z]+)\(\-[1-9]*\)','_minus1'),'x^(-1)+y_minus1+z_minus1=0')
+
 */
 
 /*

reply via email to

[Prev in Thread] Current Thread [Next in Thread]