[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
## Re: Adding dot product operation to GNU Datamash

**From**: |
Erik Auerswald |

**Subject**: |
Re: Adding dot product operation to GNU Datamash |

**Date**: |
Sat, 6 Aug 2022 19:57:28 +0200 |

**User-agent**: |
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 |

Hi,
On 06.08.22 03:30, Tim Rice wrote:

`I've been thinking about this for a while: it would be nice to have an
``operation which multiplies the corresponding records of two columns and
``returns the sum of these products. Aka the dot product or scalar product
``of the two columns.
`

`At the moment, you could do something similar by combining GNU Datamash
``with GNU Awk:
`
```
$ awk '{print $1 * $2}' /tmp/data.txt | datamash sum 1
```
Or you could do it all in gawk if you want:
```
$ awk '{sum += $1 * $2} END{print sum}' /tmp/data.txt
```
But I think doing it all in GNU Datamash allows a more intuitive command:
```
$ datamash -W dotprod 1:2 < /tmp/data.txt
```

`A proposed implementation is attached. Please let me know if you see any
``problems with it.
`

I looked at the diff and did not see any obvious problems. I do
not see a reason not to add that operation either.

`If this looks good, then it should be trivial to also add a weighted
``mean. That will just be like the dot product except for dividing the
``result by one of the column sums. (But which column should be preferred
``for that? Maybe need to pass an extra option?)
`

It might suffice to always divide by the sum of the first column,
if the code keeps the order of the given fields. I think it does,
but I did not verify this.
This would allow to use "weighted_mean 1:2" resp. "weighted_mean 2:1"
to divide by the sum of column 1 resp. 2.
("weighted_mean" is just a placeholder, of course, I just needed
some name to illustrate the idea.)
Br,
Erik