[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Adding dot product operation to GNU Datamash

From: Erik Auerswald
Subject: Re: Adding dot product operation to GNU Datamash
Date: Sat, 6 Aug 2022 19:57:28 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0


On 06.08.22 03:30, Tim Rice wrote:

I've been thinking about this for a while: it would be nice to have an operation which multiplies the corresponding records of two columns and returns the sum of these products. Aka the dot product or scalar product of the two columns.

At the moment, you could do something similar by combining GNU Datamash with GNU Awk:

$ awk '{print $1 * $2}' /tmp/data.txt | datamash sum 1

Or you could do it all in gawk if you want:

$ awk '{sum += $1 * $2} END{print sum}' /tmp/data.txt

But I think doing it all in GNU Datamash allows a more intuitive command:

$ datamash -W dotprod 1:2 < /tmp/data.txt

A proposed implementation is attached. Please let me know if you see any problems with it.

I looked at the diff and did not see any obvious problems.  I do
not see a reason not to add that operation either.

If this looks good, then it should be trivial to also add a weighted mean. That will just be like the dot product except for dividing the result by one of the column sums. (But which column should be preferred for that? Maybe need to pass an extra option?)

It might suffice to always divide by the sum of the first column,
if the code keeps the order of the given fields.  I think it does,
but I did not verify this.

This would allow to use "weighted_mean 1:2" resp. "weighted_mean 2:1"
to divide by the sum of column 1 resp. 2.

("weighted_mean" is just a placeholder, of course, I just needed
some name to illustrate the idea.)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]