bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] feature request: iconv/recode dynamic extension


From: Franta Hanzlík
Subject: Re: [bug-gawk] feature request: iconv/recode dynamic extension
Date: Sun, 23 Dec 2018 03:44:28 +0100

On Sat, 22 Dec 2018 13:32:48 +0100
Franta Hanzlík <address@hidden> wrote:

> On Sat, 22 Dec 2018 12:37:35 +0100
> Wolfgang Laun <address@hidden> wrote:
> 
> > It is correct that the Unicode Database contains a wealth of information
> > but would you like to process 31MB of XML just to learn that your Á is
> > composed from codepoints 0041 and 0301?
> > 
> > But I *guess* that OP's problem may be related to establish a (more or
> > less) correct sort order*. *Some sort orders for European languages
> > containing letters with diacritical remarks equate such letters to the
> > "stripped" letter, e.g., "dd" < "de" = "dé" = "dè" < "df". This is where
> > stripping the accents works. But even within the same language there may be
> > different sort orders, and there may be one where stripping the diacritical
> > mark would not work. gawk's sort has an extension that can handle that, and
> > it would be just as easy to generate a suitable function from a string with
> > minimal mark-up: "...no=öp..." in contrast to "...noöp...".  
> 
> In my case, I process data from the web form, where users fill their
> name, address etc., and compare this data with 'database' of users (which
> is simple text file 1 user/row with <TAB> separated items) to decide if
> the user is already in the database. And because some users fill form
> without diacritics, for better accuracy I want compare data also without
> diacritics.
> User DB has approx. 40 thousand rows(users). One solution could also be
> convert DB file (with recode or iconv utility) to file without diacritics
> and join both files or process them in awk separately - this should avoid
> long processing time. Subsequent form data de-diacritics conversion can be
> done with some slow method, as it is small number of strings (<10).
> 
> I will try your dedia() function and try to measure its speed. If it will
> be sufficient, problem will be solved.

After implementing Wolfgang's dedia() function and some tests, I got these
results (tested on flat file UserDB with 55000 users, average length cca
160 chars/user, dedia() know 52 accented chars):

iconv   - 0.12 sec
recode  - 0.17 sec
dedia() - 4.0 sec (without dedia() I fill my arrays within <2 sec, with dedia
the time increased to about 6 sec).

Although dedia() works much more slowly than iconv/recode, the result is very
nice and it's enough for me. Conversion with awk dedia() is much faster than
I thought.

Thus IMO conclusion is that conversions using iconv/recode dynamic extension
would be much quicker and more complex, but for less demanding needs (as here),
a solution with awk own routines is enough.

Thanks, Franta

> > On Sat, 22 Dec 2018 at 11:08, Eli Zaretskii <address@hidden> wrote:
> >   
>  [...]  
> > > individual    
>  [...]  
> > >
> > > As I tried to explain, it isn't transliteration that is being sought
> > > here, it's removal of combining accents and diacritics.
> > >    
>  [...]  
> > >
> > > Yes, of course.  But coming up with the list of such translations on
> > > one's own is a huge job, and the Unicode database already has all that
> > > figured out.  So my suggestion would be to import their tables, rather
> > > than create them from scratch manually.
> > >
> > > Of course, for one-off jobs that need to handle only a small set of
> > > accented characters, what you suggest is sufficient.  My
> > > interpretation of the question was that a solution for a more general
> > > problem was sought.  
-- 
Franta Hanzlik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]