[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] binary mode for strings?

From: Petr Slansky
Subject: Re: [bug-gawk] binary mode for strings?
Date: Sun, 6 May 2018 15:59:43 +0200

Thank you for your creative ideas!

I would like to highlight, that you assume that GAWK is always instaled version of AWK. It is not the case. I have to address CP1250 related issues only when GAWK is installed.
When fresh Ubuntu is installed, MAWK is installed (mawk 1.3.3 Nov 1996). That legacy AWK doesn't have many great extensions of GAWK but it doesn't have any problem with input files in CP1250 encoding.
MAWK has no "-b" switch and it reports error (awk: not an option: -b).

filenc.awk doesn't work with data from STDIN. It requires external utility "iconv", I think that iconv is installed in Ubuntu by default. It is not instaled at Windows OS.

Hint to use LC_ALL="CP1250" works but I don't see difference against LC_ALL="C". I don't use sort. And what about the case I will have mix of files, some of them in CP1250, others in ISO-8859-2 and I don't want to do any changes in encoding, I just want to reformat input file to different format and keep encoding.

I prefer to write AWK code that works with old MAWK  as well with GAWK. I have simple task that doesn't require power of gawk. I just need to parse some files in fixed length field format, a lot of work for "substr" calls. Problem is that data files are in "legacy" encoding (CP1250) and that modern OS use UTF8 by default. In this case, MAWK and GAWK are in similar position like PYTHON2 and PYTHON3, great PYTHON3 doesn't replace old PYTHON2 because it is not backward compatible, these are two different languages...

I have found a way to detect "locale" problem with GAWK, function blength()
# get string length, ignore locale
function blength(X,     _XX_)
  return(split(X, _XX_, ""));
I exit with an error when (length(S) != blength(S)); I know I cannot use substr(), index(), match(), length(), etc and GAWK has to be called with "-b" switch.

I have similar function for substr, I call it bsubstr. It is just a workarround, even strings returned by bsubstr are still not perfect, I cannot work with them in the old way (length(), substrs(), index()) but I can send them to output (print, printf, etc).

BTW, this is GWAK I use (Ubuntu 16.04.5):

# awk -W version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

From my point of view, GAWK has to add some way to switch string processing functions to "old way" and to do it from script. Just add another magic variable... ;-)

On Wed, May 2, 2018 at 5:11 PM, Manuel Collado <address@hidden> wrote:
El 02/05/2018 a las 0:17, address@hidden escribió:

You may wish to consider using iconv along with gawk, and letting iconv
convert the data from CP1250 to UTF-8 or something else more suitable
for gawk on Linux.

The attached "filenc.awk" script can automate the conversion process. Just add an '::encoding' suffix to the foreign input file arguments, and they will be converted to the current locale before processing.

Petr Slansky <address@hidden> wrote:

$ awk -f demo1.awk test.txt
Price is 35.12� (EUR).
Price is 35.12 (EUR).

Change this to:

$ gawk -i filenc -f demo1.awk test.txt::cp1252

It works on my Windows10/Cygwin64 platform. And should work on Linux platforms with iconv installed.

Hope this help. Regards.
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

reply via email to

[Prev in Thread] Current Thread [Next in Thread]