[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Bug in [...]* matching with acute-u
From: |
Jorge Stolfi |
Subject: |
Bug in [...]* matching with acute-u |
Date: |
Sat, 27 Jan 2001 06:46:11 -0200 (EDT) |
Hi,
I think I have run into a bug in gawk's handling of REs of the
form [...]* when the bracketed list includes certain 8-bit characters,
specifically u-acute (octal \372).
The problem occurs in GNU Awk 3.0.4, both under
Linux 2.2.14-5.0 (intel i686) and SunOS 5.5 (Sun sparc).
Here is a program that illustrates the bug, and its output.
The first two lines of the output should be equal, shouldn't they?
----------------------------------------------------------------------
#! /usr/bin/gawk -f
BEGIN {
s = "bananas and ananases in canaan";
t = s; gsub(/[an]*n/, "AN", t); printf "%-8s %s\n", "[an]*n", t;
t = s; gsub(/[anú]*n/, "AN", t); printf "%-8s %s\n", "[anú]*n", t;
print "";
t = s; gsub(/[aú]*n/, "AN", t); printf "%-8s %s\n", "[aú]*n", t;
print "";
t = s; gsub(/[an]n/, "AN", t); printf "%-8s %s\n", "[an]n", t;
t = s; gsub(/[aú]n/, "AN", t); printf "%-8s %s\n", "[aú]n", t;
t = s; gsub(/[anú]n/, "AN", t); printf "%-8s %s\n", "[anú]n", t;
print "";
t = s; gsub(/[an]?n/, "AN", t); printf "%-8s %s\n", "[an]?n", t;
t = s; gsub(/[aú]?n/, "AN", t); printf "%-8s %s\n", "[aú]?n", t;
t = s; gsub(/[anú]?n/, "AN", t); printf "%-8s %s\n", "[anú]?n", t;
print "";
t = s; gsub(/[an]+n/, "AN", t); printf "%-8s %s\n", "[an]+n", t;
t = s; gsub(/[aú]+n/, "AN", t); printf "%-8s %s\n", "[aú]+n", t;
t = s; gsub(/[anú]+n/, "AN", t); printf "%-8s %s\n", "[anú]+n", t;
}
----------------------------------------------------------------------
[an]*n bANas ANd ANases iAN cAN
[anú]*n bananas and ananases in canaan
[aú]*n bANANas ANd ANANases iAN cANAN
[an]n bANANas ANd ANANases in cANaAN
[aú]n bANANas ANd ANANases in cANaAN
[anú]n bANANas ANd ANANases in cANaAN
[an]?n bANANas ANd ANANases iAN cANaAN
[aú]?n bANANas ANd ANANases iAN cANaAN
[anú]?n bANANas ANd ANANases iAN cANaAN
[an]+n bANas ANd ANases in cAN
[aú]+n bANANas ANd ANANases in cANAN
[anú]+n bananas and ananases in canaan
----------------------------------------------------------------------
Apparently the problem is specific to u-acute; I've tried several
other 8-bit characters and they seem to behave as expected.
By comparing the second and third output lines, it would seem that the
problem involves backtracking out of a partial match of [...]* in
order to match the next sub-expression, when the latter begins with
one of the given characters.
All the best,
--stolfi
------------------------------------------------------------------------
Jorge Stolfi | http://www.dcc.unicamp.br/~stolfi | address@hidden
Institute of Computing (formerly DCC-IMECC) | Wrk +55 (19)3788-5858
Universidade Estadual de Campinas (UNICAMP) | +55 (19)3788-5840
Av. Albert Einstein 1251 - Caixa Postal 6176 | Fax +55 (19)3788-5847
13083-970 Campinas, SP -- Brazil | Hom +55 (19)3287-4069
------------------------------------------------------------------------
- Bug in [...]* matching with acute-u,
Jorge Stolfi <=