bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-datamash] Unable to use comma as field separator when the local


From: Erik Auerswald
Subject: Re: [Bug-datamash] Unable to use comma as field separator when the locale uses it as a decimal separator too
Date: Thu, 14 Jul 2022 14:41:55 +0200

Hi,

On Wed, Sep 12, 2018 at 05:42:09PM +0200, Jérémie Roquet wrote:
> [...]
> With datamash using the locale to determine which decimal separator to
> use, the behavior becomes inconsistent when the field separator and
> the decimal separator are the same.
> 
> For example:
> 
>   $ printf '1,2\n' | LC_NUMERIC=fr_FR.UTF-8 datamash -t, sum 1
>   datamash: invalid numeric value in line 1 field 1: '1'
>   $ printf '1,2\n' | LC_NUMERIC=fr_FR.UTF-8 datamash -t, sum 2
>   2
> 
> whereas:
> 
>   $ printf '1,2\n' | LC_NUMERIC=en_US.UTF-8 datamash -t, sum 1
>   1
>   $ printf '1,2\n' | LC_NUMERIC=en_US.UTF-8 datamash -t, sum 2
>   2

I can reproduce this with a German locale and current datamash, too:

    $ echo $LC_NUMERIC
    de_DE.UTF-8
    $ ./datamash --version | head -n1
    datamash (GNU datamash) 1.7.34-4187
    $ echo 1,2 | ./datamash -t, sum 1
    ./datamash: invalid numeric value in line 1 field 1: '1'
    $ echo 1,2 | ./datamash -t, sum 2
    2

Not all operations are affected:

    $ echo 1,2 | ./datamash -t, count 1
    1
    $ echo 1,2 | ./datamash -t, count 2
    1

This issue affects using a period ('.') as field separator in locales
where a period is used as decimal point, too:

    $ echo 1.2 | env LC_ALL=C ./datamash -t. sum 1
    ./datamash: invalid numeric value in line 1 field 1: '1'
    $ echo 1.2 | env LC_ALL=C ./datamash -t. sum 2
    2

    $ echo 1.2 | env LC_NUMERIC=enUS.UTF-8  ./datamash -t. sum 1
    ./datamash: invalid numeric value in line 1 field 1: '1'
    $ echo 1.2 | env LC_NUMERIC=enUS.UTF-8  ./datamash -t. sum 2
    2

> In my opinion, this is surprising because:
>  - working on “raw” tabular data, you don't expect “smart” handling of
> anything (that would have been less surprising from Microsoft Excel,
> because it implements escaping of separators in formats like CSV);
>  - parsing works well for the last column, but not for the others;
>  - the error message reports a well-parsed numeric value, not an
> obviously invalid one.

I concur that the error message does not make sense.

I think that this is a bug.  I have not yet looked at the code regarding
this issue, thus I cannot say how effortful it would be to fix it.

I would say that the best interpretation of using the decimal separator
as field separator is to remove the ability to use floating point input
data as floating point numbers (it might be intended to split the input
into two integers, or one might work with a list of IPv4 addresses).

> Assaf, you previously mentioned that you noted that “other GNU
> utilities seem to ignore current locale when parsing input and always
> accept decimal point” [1], and I'd argue that this would be the best
> default approach.
> 
> What do you think?
> [...]
> [1] https://lists.gnu.org/archive/html/bug-datamash/2016-08/msg00001.html

I am not Assaf, but I think that it would be problematic to radically
change input parsing to ignore locale settings.

External data often uses localized representations (e.g., bank
statements).  It can be quite useful to be able to work with the
original data.

Changing program behavior to now longer support such a use case, after
it did work for many years, seems like a bad idea to me.

Br,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]