[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Problem with substr() after match() with non-ASCII charac
Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Mon, 24 Aug 2015 18:30:48 +0300
Heirloom mailx 12.5 6/20/10
> From: Janis Papanagnou <address@hidden>
> To: "address@hidden" <address@hidden>
> Date: Sat, 22 Aug 2015 22:33:52 +0200
> Subject: [bug-gawk] Problem with substr() after match() with non-ASCII
> The issue was observed using GNU awk 4.1.2 and confirmed to show the
> same behaviour in GNU awk 4.1.3.
> With the attached program 'testprog' applied on the attached data 'testdata'
> I do *not* get the expected result of four lines containing "2007" each, but
> instead I get:
> The problem is caused/triggered by non-ASCII characters in 'testdata'.
> Note: I can run 'testprog' it with LC_ALL=C and the output is as expected.
The problem is that you're feeding gawk invalid multibyte data for
the locale you're in. When gawk tries to figure out where, in terms of
characters, the match starts, it gets confused because of this invalid
$ LC_ALL=en_US.UTF-8 gawk --lint -f testprog testdata
gawk: testprog:2: (FILENAME=testdata FNR=2) warning: Invalid multibyte
data detected. There may be a mismatch between your data and your locale.
> My understanding is, though, that the implicit results from the match()
> function, RSTART and RLENGTH, should be consistently usable in substr(),
> independent of the locale setting.
*When the data is valid*, this is correct and things work as expected.
In your case, it's Garbage In, Garbage Out. :-(
If there's a way to set the locale to latin-whatever for where you
are, then things will probably work ok. Otherwise, you should use
LC_ALL=C or the -b option.
There really is no way around this; the underlying C library routines
depend on the value of the locale variables in order to interpret
the input data.