[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Problem with substr() after match() with non-ASCII charac
Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Sun, 23 Aug 2015 22:32:12 +0100
2015-08-22 22:33:52 +0200, Janis Papanagnou:
> The issue was observed using GNU awk 4.1.2 and confirmed to show the
> same behaviour in GNU awk 4.1.3.
> With the attached program 'testprog' applied on the attached data 'testdata'
> I do *not* get the expected result of four lines containing "2007" each, but
> instead I get:
> The problem is caused/triggered by non-ASCII characters in 'testdata'.
> Note: I can run 'testprog' it with LC_ALL=C and the output is as expected.
> My understanding is, though, that the implicit results from the match()
> function, RSTART and RLENGTH, should be consistently usable in substr(),
> independent of the locale setting.
Note that in a UTF-8 locale, that testdata is not valid text.
Those bytes don't form valid characters.
While the behaviour would be unspecified by POSIX, here I'd
agree gawk has some inconsistency in that those invalid by
sequences are considered of length 0 for length, index and
substr but of length 1 for match.
To me, the best approach would be that they be of length 1 all
the time (and that they also match /./ (they don't in GNU tools
in general, they don't even match ? in GNU fnmatch, though they
do in the GNU shell's ?)).
Here though, you should use a locale where that data is valid
text. If you don't know the encoding but don't care an know it's
single-byte, the C locale is a good option.