bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

article about gawk best practices in data science and feature proposal


From: Ivan Molineris
Subject: article about gawk best practices in data science and feature proposal
Date: Thu, 11 Feb 2021 10:53:19 +0100

Hi all,
I start a new thread even if it is related to the "complie with mpfr
support" that I recently opened.

I'm a bioinformatician and I use gawk in everyday work.
Me, as well as many other data scientist, use wrapper scripts to set by
default some variables, like -F'\t' -v OFS='\t'.

I recently discovered that it is fundamental, in our work, to set also -M,
since we work often with number very close to 0 and we must avoid cases
like this:
$ echo 1.8e-308 | gawk '$1<0.05 {print "true"}'
that do not print "true" without -M

Is there a good article about gawk best practices in data science?

I would like to propose to the community a simple wrapper script that
implements such good practices, including e.g. the setting of -M, -F'\t',
-v OFS='\t'.

Moreover, one of the biggest drawbacks of gawk in our field is the fact
that, indicating the columns of the input by numbers often produces hard to
read scripts.
For this reason in the wrapper I commonly use it is possible to refer to
columns not only by number, but also by name.

For example, if a file is composed like this:

chromosome     start        end
      chr1       241      53521
      chr1       363      43623
      chr2      5243     234562

gawk '{l=$2-$1}'
can be also written as
gawk '{l=$end-$start}'

I know that this syntax is not back-compatible, maybe can be improved.

Do you know if someone has reasoned about a feature like this one in the
past?

Best regards


reply via email to

[Prev in Thread] Current Thread [Next in Thread]