bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] mystrtonum for any awk (Was: Handling hexadecimals in dif


From: Jarno Suni
Subject: Re: [bug-gawk] mystrtonum for any awk (Was: Handling hexadecimals in different modes)
Date: Tue, 22 Sep 2015 13:51:14 +0300

On Wed, 9 Sep 2015 21:55:49 +0300
Jarno Suni <address@hidden> wrote:

> On Tue, 08 Sep 2015 07:34:12 -0600
> address@hidden wrote:
> 
> > Jarno Suni <address@hidden> wrote:
> > 
> > > But if you want to write portable scripts, strtonum() is hard to
> > > recommend as it is not found in other awk implementation AFAIK.
> > 
> > The gawk doc includes an awk implementatin of strtonum that can
> > be used with other awks, so the above isn't an issue.
> 
> Well, there are few issues with the mystrtonum() implementation
> available at
> http://www.gnu.org/software/gawk/manual/gawk.html#Strtonum-Function :
> 
> Implicit decimal conversion "ret = str + 0" might not work, if awk
> implementation follows locale setting concerning the decimal
> separator.
> 
> Using 
> /^[-+]?([0-9]+([.][0-9]*([Ee][0-9]+)?)?|([.][0-9]+([Ee][-+]?[0-9]+)?))$/
> for checking, if the string reprecents valid decimal number, is not
> correct. It tells e.g. "3e-2" is invalid i.e. "NOT-A-NUMBER".
> 
> By the way, strtonum() seems to
> convert /^[-+]?[0-9]*\.?[0-9]*([Ee][-+]?[0-9]+)?/ part and ignore the
> rest of the source string. In posix mode or in use-lc-numeric mode
> it uses the decimal separator of the current locale instead of period,
> though. Still gawk doc says "Note also that strtonum() uses the
> current locale’s decimal point for recognizing numbers".

So there is an error in the doc.

> Somewhat more restricting validation would
> be /^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/ which matches
> AWK numeric constants according to the manual page of mawk command. 
> 
> Character class [[:xdigit:]] is not understood by all awk
> variants. That is the case with old widely used version (1.3.3) of
> mawk. I guess [0-9a-fA-F] is more portable.
> 
> I am writing another mystrtonum() implementation to address these
> issues and more.

So here's the new strtonum replacement, though it is not identical. It
extends current strtonum so that it recognizes and can handle more
bases. Octal numbers are recognized differently. It recognizes the same
decimal separator than sprintf uses; the decimal separator varies
according to awk implementation and implementation specific options.
Besides, convertion of big numbers may not work by all awk
implementations, since not all of them can handle arbitrary big
numbers. The function does not recognize thousand separators. It
returns "NaN" for invalid numbers or "IB" for invalid base.

#!/usr/bin/awk -f
function convert_from_base(base, str, i,   ret, n, p)
{
        n = length(str)
        if (n<i) return 0
        if ((ret=v[substr(str, i, 1)])!="" && ret<base) {
            i++
            for (; i<=n; i++)
              if ((p=v[substr(str, i, 1)])!="" && p<base)
                ret = ret*base + p
                else return nan
            return ret
        } else return nan
}
function mystrtonum(str, base,  c)
{
    if (base) {        
        if (base < 2 || base > ld) {
            return invalid_base
        }
        # natural number of given base
        return convert_from_base(base, str, 1)
    } else if (substr(str, 1, 1)=="0") {
        c=tolower(substr(str, 2, 1))
        if (c=="b") {
            # natural binary
            return convert_from_base(2, str, 3)
        } else if (c=="o") {
            # natural octal
            return convert_from_base(8, str, 3)
        } else if (c=="d") {
            # natural decimal
            return convert_from_base(10, str, 3)
        } else if (c=="x") {
            # natural hexadecimal
            return convert_from_base(16, str, 3)
        }
    }
    
    if (str !~ rd) return nan;
    # decimal, possibly floating point
    return str + 0
}
 BEGIN {
     nan="NaN" # marks "Not a Number"
     invalid_base="IB" # marks "Invalid Base"
     digits="0123456789abcdefghijklmnopqrstuvwxyz"
     ld=length(digits) # maximum base
     for(i=0; i<length(digits); i++) v[substr(digits,i+1,1)]=i
     for(i=10; i<length(digits); i++) v[toupper(substr(digits,i+1,1))]=i
     d=substr(sprintf("%g",1.1),2,1); # d is the decimal separator
     rd="^[-+]?([0-9]+\\" d "?|\\" d "[0-9])[0-9]*([eE][-+]?[0-9]+)?$"
     # rd is regular expression to match decimal floating point number


     # test harness
     a[0]="-.1"
     a[1]="25"
     a[2]=".31"
     a[3]="0123"
     a[4]="0xdeadBEEF"
     a[5]="123.45"
     a[6]="1.e3"
     a[7]="1.32"
     a[8]="1.32E2"
     a[9]=".e2"
     a[10]="3.9e-2"
     a[11]="1e5"
     a[12]=""
     a[13]="1,123"
     a[14]="awk"
     a[15]="1 000.4"
     a[16]=".3e-2"
     a[17]="-"
     a[18]="."
     a[19]="+."
     a[20]="deadBEEF"
     a[21]="deadbeef"
     a[22]="oajiflkajrelakqlquhrelkjZ"
     a[23]="0xdead"
     a[24]=",1"
     a[25]="0,23"
     a[26]="3e-2"
     a[27]="080"
     a[28]="1.2a"
     a[29]="1,2a"
     a[30]="01e1"
     a[31]="ö"
     a[32]="0b101"
     a[33]="0o76"
     a[34]="0d96"
     a[35]="0Xf"
     
     for (i=0; i in a; i++) {
         printf "\"%s\" %g %s %d %s\n",
          a[i],
          mystrtonum(a[i]),
          mystrtonum(a[i]),
          mystrtonum(a[i],ld),
          mystrtonum(a[i],ld)
         print strtonum(a[i]) # this works only by gawk
     }
}


-- 
Jarno Ilari Suni - http://www.iki.fi/8/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]