[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
article about gawk best practices in data science and feature proposal
From: |
Ivan Molineris |
Subject: |
article about gawk best practices in data science and feature proposal |
Date: |
Thu, 11 Feb 2021 10:53:19 +0100 |
Hi all,
I start a new thread even if it is related to the "complie with mpfr
support" that I recently opened.
I'm a bioinformatician and I use gawk in everyday work.
Me, as well as many other data scientist, use wrapper scripts to set by
default some variables, like -F'\t' -v OFS='\t'.
I recently discovered that it is fundamental, in our work, to set also -M,
since we work often with number very close to 0 and we must avoid cases
like this:
$ echo 1.8e-308 | gawk '$1<0.05 {print "true"}'
that do not print "true" without -M
Is there a good article about gawk best practices in data science?
I would like to propose to the community a simple wrapper script that
implements such good practices, including e.g. the setting of -M, -F'\t',
-v OFS='\t'.
Moreover, one of the biggest drawbacks of gawk in our field is the fact
that, indicating the columns of the input by numbers often produces hard to
read scripts.
For this reason in the wrapper I commonly use it is possible to refer to
columns not only by number, but also by name.
For example, if a file is composed like this:
chromosome start end
chr1 241 53521
chr1 363 43623
chr2 5243 234562
gawk '{l=$2-$1}'
can be also written as
gawk '{l=$end-$start}'
I know that this syntax is not back-compatible, maybe can be improved.
Do you know if someone has reasoned about a feature like this one in the
past?
Best regards