Bug in [...]* matching with acute-u

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Bug in [...]* matching with acute-u

From:	Jorge Stolfi
Subject:	Bug in [...]* matching with acute-u
Date:	Sat, 27 Jan 2001 06:46:11 -0200 (EDT)

Hi,

I think I have run into a bug in gawk's handling of REs of the
form [...]* when the bracketed list includes certain 8-bit characters,
specifically u-acute (octal \372).

The problem occurs in GNU Awk 3.0.4, both under 
Linux 2.2.14-5.0 (intel i686) and SunOS 5.5 (Sun sparc).

Here is a program that illustrates the bug, and its output.
The first two lines of the output should be equal, shouldn't they?

----------------------------------------------------------------------
#! /usr/bin/gawk -f

BEGIN {
  s = "bananas and ananases in canaan";
  t = s; gsub(/[an]*n/, "AN", t);   printf "%-8s  %s\n", "[an]*n", t;
  t = s; gsub(/[anú]*n/, "AN", t);  printf "%-8s  %s\n", "[anú]*n", t;
  print "";
  t = s; gsub(/[aú]*n/, "AN", t);   printf "%-8s  %s\n", "[aú]*n", t;
  print "";
  t = s; gsub(/[an]n/, "AN", t);    printf "%-8s  %s\n", "[an]n", t;
  t = s; gsub(/[aú]n/, "AN", t);    printf "%-8s  %s\n", "[aú]n", t;
  t = s; gsub(/[anú]n/, "AN", t);   printf "%-8s  %s\n", "[anú]n", t;
  print "";
  t = s; gsub(/[an]?n/, "AN", t);   printf "%-8s  %s\n", "[an]?n", t;
  t = s; gsub(/[aú]?n/, "AN", t);   printf "%-8s  %s\n", "[aú]?n", t;
  t = s; gsub(/[anú]?n/, "AN", t);  printf "%-8s  %s\n", "[anú]?n", t;
  print "";
  t = s; gsub(/[an]+n/, "AN", t);   printf "%-8s  %s\n", "[an]+n", t;
  t = s; gsub(/[aú]+n/,  "AN", t);  printf "%-8s  %s\n", "[aú]+n", t;
  t = s; gsub(/[anú]+n/, "AN", t);  printf "%-8s  %s\n", "[anú]+n", t;
}
----------------------------------------------------------------------
[an]*n    bANas ANd ANases iAN cAN
[anú]*n   bananas and ananases in canaan

[aú]*n    bANANas ANd ANANases iAN cANAN

[an]n     bANANas ANd ANANases in cANaAN
[aú]n     bANANas ANd ANANases in cANaAN
[anú]n    bANANas ANd ANANases in cANaAN

[an]?n    bANANas ANd ANANases iAN cANaAN
[aú]?n    bANANas ANd ANANases iAN cANaAN
[anú]?n   bANANas ANd ANANases iAN cANaAN

[an]+n    bANas ANd ANases in cAN
[aú]+n    bANANas ANd ANANases in cANAN
[anú]+n   bananas and ananases in canaan
----------------------------------------------------------------------

Apparently the problem is specific to u-acute; I've tried several
other 8-bit characters and they seem to behave as expected.

By comparing the second and third output lines, it would seem that the
problem involves backtracking out of a partial match of [...]* in
order to match the next sub-expression, when the latter begins with
one of the given characters.


All the best,

--stolfi

------------------------------------------------------------------------
Jorge Stolfi | http://www.dcc.unicamp.br/~stolfi | address@hidden 
Institute of Computing (formerly DCC-IMECC)      | Wrk +55 (19)3788-5858
Universidade Estadual de Campinas (UNICAMP)      |     +55 (19)3788-5840
Av. Albert Einstein 1251 - Caixa Postal 6176     | Fax +55 (19)3788-5847
13083-970 Campinas, SP -- Brazil                 | Hom +55 (19)3287-4069        
         
------------------------------------------------------------------------

[Prev in Thread]

Current Thread

[Next in Thread]

Bug in [...]* matching with acute-u, Jorge Stolfi <=
- Re: Bug in [...]* matching with acute-u, Hans-Bernhard Broeker, 2001/01/29
- Re: Bug in [...]* matching with acute-u, Aharon Robbins, 2001/01/30

Prev by Date: redhat 7.0 installation
Next by Date: Re: Man pages in general
Previous by thread: redhat 7.0 installation
Next by thread: Re: Bug in [...]* matching with acute-u
Index(es):
- Date
- Thread