octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new strsplit function


From: Philip Nienhuis
Subject: Re: new strsplit function
Date: Sat, 20 Apr 2013 21:51:54 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6

Ben Abbott wrote:
On Apr 3, 2013, at 8:44 AM, Ben Abbott wrote:

On Apr 2, 2013, at 11:42 PM, John W. Eaton wrote:

On 04/02/2013 07:40 PM, Ben Abbott wrote:

On Apr 2, 2013, at 6:35 PM, Carnë Draug wrote:

On 2 April 2013 18:04, Ben Abbott<address@hidden>   wrote:

On Apr 2, 2013, at 12:00 PM, Carnë Draug wrote:

On 2 April 2013 13:02, Ben Abbott<address@hidden>   wrote:
I've added a "conventional" to the possible delimitertypes, and selected this 
as the default when all delimiters are scalar characters. I've also enabled 2D character 
array inputs. The following replacements may be made into OF, and Octave core, to use 
Jaroslav's implementation.  A changeset to core to use these substitutions will also be 
needed.

- strsplit (str, del)
+ strsplit (str, del, "collapsedelimiters", false, "delimitertype", 
"conventional")

- strsplit (str, del, false)
+ strsplit (str, del, "collapsedelimiters", false, "delimitertype", 
"conventional")

- strsplit (str, del, true)
+ strsplit (str, del, "collapsedelimiters", true, "delimitertype", 
"conventional")

That means that OF packages with this change, will only work after 3.8
being released, and the ones without the change will stop working once
3.8 is released, so the releases would have to be timed.

Instead, I propose we add the old strsplit in the private directory of
each package. Then, once 3.8 gets released, we can remove them and
make the change as each package gets re-released.

That's a good idea!

I have just made the change on revision 11779. I have also created a
new task on the trackers https://savannah.gnu.org/task/index.php?12561
so we don't forget what packages got affected.

Finally, geometry and mechanics package need to be fixed some other
way since they make use of it on PKG_DEL and PKD_ADD scripts.

Carnë

I pushed a slightly modified changeset.

        http://hg.savannah.gnu.org/hgweb/octave/rev/5be43435bd5b

In a recent changeset, you made the following change:

-    list = strsplit (list, " \n\t", true);
+    list = strsplit (list, " \n\t", true, "delimitertype", "legacy");

If people will have to make a change in their code to recover previous behavior 
does is really help to introduce a new option like this?  Isn't this equivalent 
to

list = strsplit (list, {" ", "\n", "\t"})

?  Why not just tell people to make this change instead of introducing the 
"legacy" delimiter type?

The "legacy" delimiter type was introduced because Jaroslav's implementation 
was roughly 30x faster than the regexp approach.  In many instances the slow down 
probably isn't a big deal, but for strread / textscan the slow down would be a heavy 
burden.

Also, the new strsplit does not seem to be working correctly for single-quoted 
strings that contain escape sequences.  I think that

strsplit ("foo\tbar", '\t')

should split on the TAB, but it is currently returning the original string for 
me.

Opps.  I hadn't considered single quoted strings.  Your example does work in 
Matlab.  How is this sort of thing handled in other places?  It is sufficient 
to just ...

        delimiter = sprintf (delimiter) ?

This is starting to look a bit messy.  Maybe we should revert this change until 
we have a better plan for the transition.  Maybe it would be good to give 
people some advance warning before making this change all at once.


Sorry, I hadn't considered that.  I say Jordi's request for someone to take 
this on, and it looked like a short project I could find the time for.

        
https://mailman.cae.wisc.edu/pipermail/octave-maintainers/2013-March/032734.html

To revert, the following three changesets need to be included.

        http://hg.savannah.gnu.org/hgweb/octave/rev/1de4ec2a856d

        http://hg.savannah.gnu.org/hgweb/octave/rev/5be43435bd5b

        http://hg.savannah.gnu.org/hgweb/octave/rev/61989cde13ae

Ben

All,

The attached changeset adds sq-string support for "simple" delimiters.  Unless, 
I've missed something else, that should make strsplit() fully Matlab compatible.

Regarding the "legacy" delimiter type, jwe made a good point.  I also sympathised with 
Rik/Philip suggestion to preserve the original algorithm for its speed advantage.   Personally, I'm 
inclined to leave "legacy" in place, but can be easily swayed to revert that.

Comments?

Obviously the point is not to have the "legacy" option per sé, but to preserve the 'legacy speed' as much as possible for simple cases. Could that be done by parsing the input and -depending on the results- choose a flexible but slow path, or a fast but legacy/limited path? (note I haven't delved deep in your patches yet)

I'm not opposed to requiring a cellstr array for multiple delimiters. If all delimiters are single characters we could do s/th like:
  delimiter = [delimiter{:}]
and choose the 'legacy' path, if there's a multi-char delimiter we'd need regexp() and, consequently, a slower path.

BTW I have a multi-char delimiter strread.m lying around somewhere that worked with strrep(), one call for each multi-char delimiter on the entire input character array. Worked slow as well as strrep() is slow (does it invoke regexp() behind the scenes?) but there are several other calls to strrep() in strread.m anyway, no big deal. With the new strsplit, multi-char delimiters can more formally be integrated in strread.m, we'd have some advantage (and a useful one) over ML then.

And then there is the bug John mentioned with escape characters.
On a stable MinGW 3.6.4 build it doesn't work either:

octave:1> strsplit ("foo\tbar", '\t')
ans =
{
  [1,1] = foo   bar
}

...but:

octave:2> strsplit ("foo\tbar", "\t")  ## double quoted \t
ans =
{
  [1,1] = foo
  [1,2] = bar
}

I think that hints to another bug, not so much in strsplit itself.

Philip


reply via email to

[Prev in Thread] Current Thread [Next in Thread]