[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Possible bug when field-separator used
From: |
Erik Auerswald |
Subject: |
Re: Possible bug when field-separator used |
Date: |
Mon, 4 Dec 2023 13:55:14 +0100 |
Hi,
On Mon, Dec 04, 2023 at 01:31:45PM +0100, Jeroen Hoek wrote:
> On Thu, Nov 30, 2023 at 08:49:44PM +0000, Tim Rice wrote:
>
> > I agree that when there are interactions between the locale and
> > the field separator, the input should be considered faulty. In
> > this case, do you think it would be sufficient to simply report an
> > error and stop further processing?
>
> I think there are three scenarios:
>
> If (and only if) the chosen field separator matches
> LC_NUMERIC.decimal_point:
>
> * Fail even before parsing anything, explain to user about locale
> issue and how to call datamash with LC_NUMERIC set.
>
> Downside: annoying if the input doesn't even use decimal points. It
> is helpful in pointing out the datamash adheres to LC_NUMERIC
> though.
Then there is the problem that not only decimal separator as field
separator can result in this kind of problem, but also using some ASCII
letters as field separators. This is less common (most likely rather
uncommon), but I would prefer it if this could be handled as well.
> * If a line fails, explain to user like above.
>
> Would only work if it is possible to detect the cause of the error.
> Telling the user that datamash may have failed because of the user's
> locale setting is useful though.
I'd expect that to be possible without too much work. Currently, any
strtold() problem results in the same error code FLOCR_INVALID_NUMBER.
The locale-specific problems relate to strtold() parsing over the field
separator. This is caught by "endptr!=(str+slen)" and could result in
some more specific error message.
See src/field-ops.c function field_op_collect().
> * Disable parsing of numeric points.
>
> The last option probably causes the least amount of surprises. If a
> user specifies a field separator, they tend to know that it is
> suitable for the data they are inputting.
I'd expect a CSV file with integers to be usable even if the locale's
decimal separator is the same as the field separator used for the
CSV file.
We could also always use the code inside the "#ifdef HAVE_BROKEN_STRTOLD"
all the time, or use something alike as a fallback in case of the
"endptr!=(str+slen)" error condition after strtold().
> On 01-12-2023 09:16, Erik Auerswald wrote:
>
> > I had added a few failing tests, commented out, as documentation
> > for the issues, to the file "tests/datamash-tests-2.pl"
> > in commit 2b181b38e86f067fa68a7b174171a24e80313361
> > <https://git.savannah.gnu.org/gitweb/?p=datamash.git;a=blob;f=tests/datamash-tests-2.pl;h=3d09c19138f6a83d19074c4aa8c122964f4b24ea;hb=HEAD#l692>.
>
> > On Thu, Nov 30, 2023 at 08:49:44PM +0000, Tim Rice wrote:
> >
> > > If you have some cycles to contribute, it would help me get
> > > started if there was a test available. The tests are written in
> > > perl, so you may find it fairly easy to add a new one, by copying
> > > from another test that already exists. It would be great to see
> > > a patch which implements a test for whether GNU Datamash handles
> > > this faulty data correctly.
>
> Looking at the disabled tests, ifdl12 and ifdl13 address this bug
> for the C (or en_*, etc.) locale.
>
> Unfortunately to add the other relevant test cases (like commas), it
> would be necessary to somehow set the `decimal_point` value taken
> from LC_NUMERIC. I don't know how that can be done without depending
> on locales which might not be available for the build though.
There is one internationalization test file: "tests/datamash-i18n-de.pl".
This has a few commented-out tests for this issue.
Best regards,
Erik