Hi Erik,
Good catch on the locale, that is indeed part of the reproduction case.
I didn't consider that.
Works:
echo -e "a,14,1\nb,1,14\na,2,1" | \
LC_NUMERIC=en_GB.UTF-8 \
datamash --field-separator=, -s groupby 1 sum 2,3
Fails:
echo -e "a,14,1\nb,1,14\na,2,1" | \
LC_NUMERIC=nl_NL.UTF-8 \
datamash --field-separator=, -s groupby 1 sum 2,3
And indeed, in Dutch as in German, the decimal separator is the comma.
Using . as separator reverses this: en_GB fails and nl_NL works.
As an end user I wouldn't mind if decimal separators which happen to
match the specified field separator do not get interpreted as decimal
separators at all. I would consider such input as faulty. (This goes for
periods too in relevant locales I suppose).
Thanks for looking into this.
On 30-11-2023 14:22, Erik Auerswald wrote:
Hi Jeroen,
I think this is an interaction with the locale support of GNU Datamash
and the way GNU Datamash parses numbers. You can work around it by
temporarily overwriting the locale settings:
echo -e "a,14,1\nb,1,14\na,2,1" | \
LC_ALL=C datamash --field-separator=, -s groupby 1 sum 2,3
--> a,16,2
--> b,1,14
The problem occurs as soon as the second column is summed over:
echo -e "a,14,1\nb,1,14\na,2,1" | \
datamash --field-separator=, -s groupby 1 sum 2
--> datamash: invalid numeric value in line 1 field 2: '14'
The root cause is that GNU Datamash uses the locale settings for parsing
its input, and thus treats ',' as decimal separator in some locales
(e.g., in the de_DE.UTF-8 locale). This interacts with using ',' as
field separator.
I have not looked into the code and thus do not know how involved it
would be to fix this. (I do think this is a bug.)
Best regards,
Erik