bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] feature request: iconv/recode dynamic extension


From: Eli Zaretskii
Subject: Re: [bug-gawk] feature request: iconv/recode dynamic extension
Date: Sat, 22 Dec 2018 09:40:32 +0200

> Date: Sat, 22 Dec 2018 02:29:37 +0100
> From: Franta Hanzlík <address@hidden>
> 
> not sure when it is good idea, but I think this may be usefull for
> others also: I'm just doing some word processing in gawk, and it's
> part is two string comparison. These strings are plaintext ASCII
> strings obtained by removing diacritics from the original Latin-1
> and Latin-2 strings - thus I need conversion as
>  "äáéěóöščýíüúů" -> "aaeeooscyiuuu".
> For now I solve this by calling external conversion program - as
> 
> iconv -f UTF-8 -t US-ASCII//TRANSLIT <<< "üöóäěščřžýáíéúů"
>    or
> recode -f u8..flat <<< "üöóäěščřžýáíéúů"
> 
> but for thousands strings it is too slow (and resource expensive).

libiconv's TRANSLIT will only work for Latin characters, as it's not
what you want in general.  What you want is the "decomposition" of
each character into the base character and the diacriticals/combining
accents; then you want to throw out the non-base parts.  How to do
that is defined by the Unicode Standard, and needs to use the various
data files provided by the UCD, the Unicode Character Database.

> There is perhaps lot of similar text conversions cases, where gawk
> dynamic extension for this needs wil be very useful.

It could be useful for such jobs, yes.  How frequently these jobs
happen in typical Gawk usage is another question; I don't have an
answer for that.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]