Re: new strsplit function

octave-maintainers
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new strsplit function

From:	Ben Abbott
Subject:	Re: new strsplit function
Date:	Sat, 20 Apr 2013 16:49:04 -0400
On Apr 20, 2013, at 3:51 PM, Philip Nienhuis wrote:

> Ben Abbott wrote:
>> On Apr 3, 2013, at 8:44 AM, Ben Abbott wrote:
>> 
>>> On Apr 2, 2013, at 11:42 PM, John W. Eaton wrote:
>>> 
>>>> On 04/02/2013 07:40 PM, Ben Abbott wrote:
>>>> 
>>>>> On Apr 2, 2013, at 6:35 PM, Carnë Draug wrote:
>>>>> 
>>>>>> On 2 April 2013 18:04, Ben Abbott<address@hidden>   wrote:
>>>>>> 
>>>>>>> On Apr 2, 2013, at 12:00 PM, Carnë Draug wrote:
>>>>>>> 
>>>>>>>> On 2 April 2013 13:02, Ben Abbott<address@hidden>   wrote:
>>>>>>>>> I've added a "conventional" to the possible delimitertypes, and 
>>>>>>>>> selected this as the default when all delimiters are scalar 
>>>>>>>>> characters. I've also enabled 2D character array inputs. The 
>>>>>>>>> following replacements may be made into OF, and Octave core, to use 
>>>>>>>>> Jaroslav's implementation.  A changeset to core to use these 
>>>>>>>>> substitutions will also be needed.
>>>>>>>>> 
>>>>>>>>> - strsplit (str, del)
>>>>>>>>> + strsplit (str, del, "collapsedelimiters", false, "delimitertype", 
>>>>>>>>> "conventional")
>>>>>>>>> 
>>>>>>>>> - strsplit (str, del, false)
>>>>>>>>> + strsplit (str, del, "collapsedelimiters", false, "delimitertype", 
>>>>>>>>> "conventional")
>>>>>>>>> 
>>>>>>>>> - strsplit (str, del, true)
>>>>>>>>> + strsplit (str, del, "collapsedelimiters", true, "delimitertype", 
>>>>>>>>> "conventional")
>>>>>>>> 
>>>>>>>> That means that OF packages with this change, will only work after 3.8
>>>>>>>> being released, and the ones without the change will stop working once
>>>>>>>> 3.8 is released, so the releases would have to be timed.
>>>>>>>> 
>>>>>>>> Instead, I propose we add the old strsplit in the private directory of
>>>>>>>> each package. Then, once 3.8 gets released, we can remove them and
>>>>>>>> make the change as each package gets re-released.
>>>>>>> 
>>>>>>> That's a good idea!
>>>>>> 
>>>>>> I have just made the change on revision 11779. I have also created a
>>>>>> new task on the trackers https://savannah.gnu.org/task/index.php?12561
>>>>>> so we don't forget what packages got affected.
>>>>>> 
>>>>>> Finally, geometry and mechanics package need to be fixed some other
>>>>>> way since they make use of it on PKG_DEL and PKD_ADD scripts.
>>>>>> 
>>>>>> Carnë
>>>>> 
>>>>> I pushed a slightly modified changeset.
>>>>> 
>>>>>   http://hg.savannah.gnu.org/hgweb/octave/rev/5be43435bd5b
>>>> 
>>>> In a recent changeset, you made the following change:
>>>> 
>>>> -    list = strsplit (list, " \n\t", true);
>>>> +    list = strsplit (list, " \n\t", true, "delimitertype", "legacy");
>>>> 
>>>> If people will have to make a change in their code to recover previous 
>>>> behavior does is really help to introduce a new option like this?  Isn't 
>>>> this equivalent to
>>>> 
>>>> list = strsplit (list, {" ", "\n", "\t"})
>>>> 
>>>> ?  Why not just tell people to make this change instead of introducing the 
>>>> "legacy" delimiter type?
>>> 
>>> The "legacy" delimiter type was introduced because Jaroslav's 
>>> implementation was roughly 30x faster than the regexp approach.  In many 
>>> instances the slow down probably isn't a big deal, but for strread / 
>>> textscan the slow down would be a heavy burden.
>>> 
>>>> Also, the new strsplit does not seem to be working correctly for 
>>>> single-quoted strings that contain escape sequences.  I think that
>>>> 
>>>> strsplit ("foo\tbar", '\t')
>>>> 
>>>> should split on the TAB, but it is currently returning the original string 
>>>> for me.
>>> 
>>> Opps.  I hadn't considered single quoted strings.  Your example does work 
>>> in Matlab.  How is this sort of thing handled in other places?  It is 
>>> sufficient to just ...
>>> 
>>>     delimiter = sprintf (delimiter) ?
>>> 
>>>> This is starting to look a bit messy.  Maybe we should revert this change 
>>>> until we have a better plan for the transition.  Maybe it would be good to 
>>>> give people some advance warning before making this change all at once.
>>> 
>>> 
>>> Sorry, I hadn't considered that.  I say Jordi's request for someone to take 
>>> this on, and it looked like a short project I could find the time for.
>>> 
>>>     
>>> https://mailman.cae.wisc.edu/pipermail/octave-maintainers/2013-March/032734.html
>>> 
>>> To revert, the following three changesets need to be included.
>>> 
>>>     http://hg.savannah.gnu.org/hgweb/octave/rev/1de4ec2a856d
>>> 
>>>     http://hg.savannah.gnu.org/hgweb/octave/rev/5be43435bd5b
>>> 
>>>     http://hg.savannah.gnu.org/hgweb/octave/rev/61989cde13ae
>>> 
>>> Ben
>> 
>> All,
>> 
>> The attached changeset adds sq-string support for "simple" delimiters.  
>> Unless, I've missed something else, that should make strsplit() fully Matlab 
>> compatible.
>> 
>> Regarding the "legacy" delimiter type, jwe made a good point.  I also 
>> sympathised with Rik/Philip suggestion to preserve the original algorithm 
>> for its speed advantage.   Personally, I'm inclined to leave "legacy" in 
>> place, but can be easily swayed to revert that.
>> 
>> Comments?
> 
> Obviously the point is not to have the "legacy" option per sé, but to 
> preserve the 'legacy speed' as much as possible for simple cases.
> Could that be done by parsing the input and -depending on the results- choose 
> a flexible but slow path, or a fast but legacy/limited path? (note I haven't 
> delved deep in your patches yet)
> 
> I'm not opposed to requiring a cellstr array for multiple delimiters. If all 
> delimiters are single characters we could do s/th like:
>  delimiter = [delimiter{:}]
> and choose the 'legacy' path, if there's a multi-char delimiter we'd need 
> regexp() and, consequently, a slower path.

This is already being done for "delimitertype" == "simple" !

> BTW I have a multi-char delimiter strread.m lying around somewhere that 
> worked with strrep(), one call for each multi-char delimiter on the entire 
> input character array. Worked slow as well as strrep() is slow (does it 
> invoke regexp() behind the scenes?) but there are several other calls to 
> strrep() in strread.m anyway, no big deal.
> With the new strsplit, multi-char delimiters can more formally be integrated 
> in strread.m, we'd have some advantage (and a useful one) over ML then.
> 
> And then there is the bug John mentioned with escape characters.
> On a stable MinGW 3.6.4 build it doesn't work either:
> 
> octave:1> strsplit ("foo\tbar", '\t')
> ans =
> {
>  [1,1] = foo   bar
> }

Is this with the changeset I attached applied?

Ben
[Prev in Thread]
Current Thread
[Next in Thread]
Re: new strsplit function, (continued)
Prev by Date: Re: new strsplit function
Next by Date: Interested in GSoC
Previous by thread: Re: new strsplit function
Next by thread: Re: new strsplit function
Index(es):
- Date
- Thread