[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
behavior of regexp ( ) function
From: |
Daniel J Sebald |
Subject: |
behavior of regexp ( ) function |
Date: |
Thu, 01 Jan 2009 00:34:17 -0600 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3) Gecko/20041020 |
Below are some results from regexp() that seem questionable given what the
documentation says (or I'm misunderstanding). Say I want to pull the
substrings from a tab separated data file. Let
octave:6> a = sprintf('20\t50\tcelcius\t80')
a = 20 50 celcius 80
octave:7> b = sprintf('20\t50\t\t80')
b = 20 50 80
be some sample lines that might come from a datafile. String a has at least
one character between tabs; b has a case where there are zero characters
between tabs. For regexp, the metacharacters [^\t] mean any ASCII value other
than a tab. The metacharacter + means match one or more times. Here are the
results for a and b processed with these metacharacters:
octave:8> regexp(a, '[^\t]+', 'match')
ans =
{
[1,1] = 20
[1,2] = 50
[1,3] = celcius
[1,4] = 80
}
Looks good.
octave:9> regexp(b, '[^\t]+', 'match')
ans =
{
[1,1] = 20
[1,2] = 50
[1,3] = 80
}
I'll go along with that result too. There are zero characters between the
second and third tab and + requires at least one match.
Now, according to the documentation, * is similar to + in concept, but there
must be a match of _zero_ or more. Here's the results for a and b processed
with those metacharacters:
octave:10> regexp(a, '[^\t]*', 'match')
ans =
{
[1,1] = 20
}
Doesn't look correct. I'm thinking this should be pretty much the same result
as with metacharacter +, i.e.,
[1,1] = 20
[1,2] = 50
[1,3] = celcius
[1,4] = 80
because + was one or more matches, and "one or more" is a subset of "zero or
more". Next result:
octave:11> regexp(b, '[^\t]*', 'match')
ans =
{
[1,1] = 20
}
Same as previous, but the way I see it, this case should result in
[1,1] = 20
[1,2] = 50
[1,3] = []
[1,4] = 80
where the third empty string comes from the fact there are zero characters between two
tabs, i.e., "zero or more".
Am I correctly understanding what "zero or more" means?
Dan
- behavior of regexp ( ) function,
Daniel J Sebald <=