bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Exposing wcwidth(3) as a built-in function


From: arnold
Subject: Re: [bug-gawk] Exposing wcwidth(3) as a built-in function
Date: Wed, 29 Nov 2017 03:25:50 -0700
User-agent: Heirloom mailx 12.4 7/29/08

Hi Eric.

Including wcwidth into gawk has not come up before. It makes little sense
since, as you note, there's no way to get the unicode code point for a
multibyte string.

Your awk function should be valid to use, at least with gawk. It might
also work with the Solaris awk; I don't know of any others that are
multibyte aware.

The section in the documentation that you noted was written > 20 years
ago, when life was somewhat simpler.  It could use a little updating,
I admit.

Adding unicode support in gawk is a tar pit that I have successfully
avoided for quite a number of years, and I intend to do my best to
continue avoiding it.  :-)

In particular, gawk works in all mulitbyte encodings supported by the
underlying OS it's running on; adding language level support for Unicode
becomes probablematic as soon as you run gawk in a non-unicode locale.
I don't intend to get anywhere close to that Pandora's Box.

That said, I think it'd be possible to write some extension functions
to expose some Unicode features in a reasonable way.  If you wish
to take that on, I'm happy to provide suggestions and guidance.
I don't have the cycles to do it myself, and prefer that users
develop such things to meet real needs, instead of my attempting
to develop such things in a vacuum.

If this interests you, let's discuss further, off-list.

Thanks,

Arnold

Eric Pruitt <address@hidden> wrote:

> On Tue, Nov 28, 2017 at 04:03:59PM -0800, Eric Pruitt wrote:
> > I generated a table consisting of character widths and two values
> > representing the beginning & end of a Unicode codepoint ranges which I
> > translated into an AWK function. At ~2,000 lines, the code is small
> > enough that parsing it doesn't add any noticeable latency on my machine.
> > Unfortunately the function is fairly useless because there isn't a way
> > to efficiently get the numeric codepoint of a character. The code in
> > https://www.gnu.org/software/gawk/manual/html_node/Ordinal-Functions.html
> > ("Translating Between Characters and Numbers") uses an array as a lookup
> > table. The documentation reads "Both functions are written very nicely
> > in awk; there is no real reason to build them into the awk interpreter"
> > but creating a lookup table that spans all of the Unicode codepoints
> > takes a non-trivial amount of time.
>
> I was able to create a working wcwidth function in pure AWK script
> without converting characters to numeric codepoints by (ab)using lexical
> comparisons. I have no clue how portable this is. I've attached the
> generated file in the event that it might be useful to someone else.
>
> Eric




reply via email to

[Prev in Thread] Current Thread [Next in Thread]