bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory


From: Eli Zaretskii
Subject: Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory ...
Date: Sat, 12 Jul 2014 11:31:29 +0300

> Date: Thu, 10 Jul 2014 19:35:56 -0600 (MDT)
> From: "Nelson H. F. Beebe" <address@hidden>
> 
> Green Fox <address@hidden> writes today:
> 
> >> when one is reading from a ( disk / server ) that does not match
> >> the local character set, the current gawk setup fails really badly.
> >> When handling filenames that is not a valid utf-8, ....
> 
> There is an extensive discussion going on now about that issue on the
> TeX Live list, which is archived at
> 
>       http://tug.org/mailman/listinfo/tex-live
> 
> The message traffic exhibits significant problems in the support of
> non-ASCII characters in filenames.  

Not really.  AFAIU, that discussion is about problems that are already
successfully solved elsewhere.  And many of the messages in that
thread specifically refer to MS-Windows, where console problems indeed
have some tough issues with supporting characters outside of the
system locale.

> The problem is much more complex than some people think, and part of
> the difficulties arise because: 
> 
>       (a) strings (such as filenames) are virtually never tagged
>           with their character sets;
> 
>       (b) filesystems can be shared between disparate operating
>           systems with different character set conventions; and
> 
>       (c) filesystem syntax generally views filenames as byte
>           sequences, rather than character strings.

That is true, but we don't even have in Gawk a reasonable solution for
when the user who invokes the script does in fact know the encoding of
the file names.  I hope you agree that this use case can and should
have a reasonably practical and easy solution.  Other programs did
find and implement such a solution.

The solution is, generally, to convert text on input into some
universal representation, be it UCS-4 or UTF-8, use that for internal
processing, and convert back to the original encoding when interfacing
with external APIs, such as standard C functions.  (If the original
encoding is already UTF-8, then the conversions are largely no-ops.)

> This is all a HUGE can of worms, and I suspect that we should avoid
> opening it.

I don't see why, since others have successfully opened it and lived to
tell the story.

> It is worth recalling that Brian Kernighan at one point
> added some limited support in nawk for multibyte coding and
> internationalization, then withdrew it on finding the portability
> problems that it exposed.  His FIXES file entry of 28 July 2003 says:
> 
>       a moratorium is hereby declared on internationalization changes.
>       i apologize to friends and colleagues in other parts of the world.
>       i would truly like to get this "right", but i don't know what
>       that is, and i do not want to keep making changes until it's clear.

In the 11 years since then things happened that IMNSHO require
revisiting the issue.  IMO, it is now known quite well what is "right"
in this case, so the main problem which Brian Kernighan faced back
then is gone.

> The awk, mawk, nawk, and oawk implementations treat files as character
> streams, where NUL (0x00) is a string terminator.  By contrast, gawk's
> view of files is that they are simply byte streams, and no byte value
> has any more significance than any other byte value: 0x00 is just a
> normal data byte.  Thus, with care, gawk can be used to read and write
> arbitrary files.  From that point of view, the less it knows about
> `characters', the better.

You cannot do useful text processing if you treat text as a byte
stream.  E.g., regular expression matching of non-ASCII text will not
work, which means only English locales will be able to do that.  Since
regular expressions are at the core of Gawk, this renders a large
chunk of Gawk features unusable in non-English locales or, more
generally, when processing languages other than English.  I think this
is unacceptable in the 2nd decade of the 21st century.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]