[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] feature request: iconv/recode dynamic extension

From: Wolfgang Laun
Subject: Re: [bug-gawk] feature request: iconv/recode dynamic extension
Date: Sat, 22 Dec 2018 12:37:35 +0100

It is correct that the Unicode Database contains a wealth of information
but would you like to process 31MB of XML just to learn that your Á is
composed from codepoints 0041 and 0301?

But I *guess* that OP's problem may be related to establish a (more or
less) correct sort order*. *Some sort orders for European languages
containing letters with diacritical remarks equate such letters to the
"stripped" letter, e.g., "dd" < "de" = "dé" = "dè" < "df". This is where
stripping the accents works. But even within the same language there may be
different sort orders, and there may be one where stripping the diacritical
mark would not work. gawk's sort has an extension that can handle that, and
it would be just as easy to generate a suitable function from a string with
minimal mark-up: "...no=öp..." in contrast to "...noöp...".


On Sat, 22 Dec 2018 at 11:08, Eli Zaretskii <address@hidden> wrote:

> > From: Wolfgang Laun <address@hidden>
> > Date: Sat, 22 Dec 2018 09:17:07 +0100
> >
> > The most general case of transliteration is handled by defining
> individual
> > characters.
> As I tried to explain, it isn't transliteration that is being sought
> here, it's removal of combining accents and diacritics.
> > You can add such a transliteration function ("dedia(str)") to
> > any awk program ("foo.awk") using a simple generator like genf.awk:
> >      gawk -- "`gawk -f genf.awk <<<"üöóäěščřžýáíéúů uooaescrzyaieuu
> > foo.awk"`
> Yes, of course.  But coming up with the list of such translations on
> one's own is a huge job, and the Unicode database already has all that
> figured out.  So my suggestion would be to import their tables, rather
> than create them from scratch manually.
> Of course, for one-off jobs that need to handle only a small set of
> accented characters, what you suggest is sufficient.  My
> interpretation of the question was that a solution for a more general
> problem was sought.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]