bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] feature request: iconv/recode dynamic extension


From: Franta Hanzlík
Subject: Re: [bug-gawk] feature request: iconv/recode dynamic extension
Date: Sun, 23 Dec 2018 11:20:12 +0100

Hi Wolfgang,

I think You understand it well (and I call for awk extension, without a
more detailed description of what I have tried). My mistake.

I was originally thinking about such a function (simplified, without
 error handling, sanitizing etc.):

function dedia(s,   CMD,r){
    CMD="echo -n \"" s "\"|recode -f u8..flat"
    CMD|getline r
    close(CMD)
    return r
}

This work well, but is slow and resource expensive.

Happy Christmas!

On Sun, 23 Dec 2018 03:57:30 +0100
Wolfgang Laun <address@hidden> wrote:

> It seems that I misunderstood your original post.
> Apparently you call iconv/recode only once for each run of your awk program.
> I wouldn't have posted my function if I had read your original post more
> carefully.
> Well, since you think it's useful :-)
> 
> Best wishes
> Wolfgang
> 
> On Sun, 23 Dec 2018 at 03:45, Franta Hanzlík <address@hidden> wrote:
> 
> > On Sat, 22 Dec 2018 13:32:48 +0100
> > Franta Hanzlík <address@hidden> wrote:
> >  
> > > On Sat, 22 Dec 2018 12:37:35 +0100
> > > Wolfgang Laun <address@hidden> wrote:
> > >  
>  [...]  
> > information  
>  [...]  
> > where  
>  [...]  
> > may be  
>  [...]  
> > diacritical  
>  [...]  
> > that, and  
>  [...]  
> > with  
>  [...]  
> > >
> > > In my case, I process data from the web form, where users fill their
> > > name, address etc., and compare this data with 'database' of users (which
> > > is simple text file 1 user/row with <TAB> separated items) to decide if
> > > the user is already in the database. And because some users fill form
> > > without diacritics, for better accuracy I want compare data also without
> > > diacritics.
> > > User DB has approx. 40 thousand rows(users). One solution could also be
> > > convert DB file (with recode or iconv utility) to file without diacritics
> > > and join both files or process them in awk separately - this should avoid
> > > long processing time. Subsequent form data de-diacritics conversion can  
> > be  
> > > done with some slow method, as it is small number of strings (<10).
> > >
> > > I will try your dedia() function and try to measure its speed. If it will
> > > be sufficient, problem will be solved.  
> >
> > After implementing Wolfgang's dedia() function and some tests, I got these
> > results (tested on flat file UserDB with 55000 users, average length cca
> > 160 chars/user, dedia() know 52 accented chars):
> >
> > iconv   - 0.12 sec
> > recode  - 0.17 sec
> > dedia() - 4.0 sec (without dedia() I fill my arrays within <2 sec, with
> > dedia
> > the time increased to about 6 sec).
> >
> > Although dedia() works much more slowly than iconv/recode, the result is
> > very
> > nice and it's enough for me. Conversion with awk dedia() is much faster
> > than
> > I thought.
> >
> > Thus IMO conclusion is that conversions using iconv/recode dynamic
> > extension
> > would be much quicker and more complex, but for less demanding needs (as
> > here),
> > a solution with awk own routines is enough.
> >
> > Thanks, Franta
> >  
>  [...]  
> > >  [...]  
>  [...]  
> > >  [...]  
>  [...]  
> > >  [...]  
>  [...]  
> > that  
>  [...]  
> > rather  
>  [...]  
> > --
> > Franta Hanzlik
-- 
Franta Hanzlik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]