bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

New bugs discovered when adding vnlog support


From: Dima Kogan
Subject: New bugs discovered when adding vnlog support
Date: Mon, 04 Jul 2022 17:21:25 -0700
User-agent: mu4e 1.6.10; emacs 29.0.50

Hi.

I just looked into adding tests for the vnlog support. Almost done. In
the process I found a few bugs that we should fix.

I'm using the latest datamash in version control. I have /tmp/data:

x y
1 0.5
2 1
3 1.5
4 2


A crosstab operation does this:

  $ < /tmp/data ./datamash -W --header-in --header-out crosstab 1,2

  GroupBy(x)    GroupBy(y)      count(x)
          0.5   1       1.5     2
  1     1       N/A     N/A     N/A
  2     N/A     1       N/A     N/A
  3     N/A     N/A     1       N/A
  4     N/A     N/A     N/A     1

Here "1, 2, 3, 4" labels the rows and "0.5, 1, 1.5, 2" labels the
columns. These are present regardless of --header-out. The "GroupBy()"
stuff is added by --header-out, and doesn't label the rows or the
columns. There are always 3 of these. These appear to label the crosstab
y axis, the crosstab x axis, and the operation respectively. This is
very confusing because normally --header-out produces labels for each
column. Proposal: if "--header-out -C" is given, put the "GroupBy" stuff
into a comment, with more text describing what they mean. If
"--header-out" is given without -C, don't produce this output at all.

Next. If I tweak this command to ask for the field by name (which should
workd because we passed --header-in), then this happens instead:

  $ < /tmp/data ./datamash -W --header-in --header-out crosstab x,y

  datamash: src/column-headers.c:62: get_input_field_name: Assertion `field_num 
> 0 && field_num <= num_input_column_headers' failed.
  zsh: IOT instruction  ./datamash -W --header-in --header-out crosstab x,y < 
/tmp/data


Next. There's an issue with the covariance operations. Here I compute
Cov(x,x) and Cov(y,y) and Cov(x,y):

  $ < /tmp/data ./datamash --header-in --header-out -C -W pcov x:x

  pcov(x)
  1.25


  $ < /tmp/data ./datamash --header-in --header-out -C -W pcov y:y

  pcov(y)
  0.3125


  $ < /tmp/data ./datamash --header-in --header-out -C -W pcov x:y

  pcov(y)
  0.625

Note the field label. For the self-covariance, it maybe makes sense. But
for the cross-covariance, it's definitely wrong: Cov(x,y) cannot be
labelled "pcov(y)".


Finally, there's an unexpected implementation detail with "reverse" and
"rmdup" that could be described as a bug. In all other modes, passing
--header-out ends up calling the print_column_headers() function before
writing out the data. But the "reverse" and "rmdup" have a separate code
path, which doesn't work the same way. This is a likely source of bugs.
I haven't looked very hard, but I just found one: "rmdup" appears to
ignore the output delimiter:

  $ < /tmp/data ./datamash --output-delimiter=: -H -W rmdup x

  x y
  1 0.5
  2 1
  3 1.5
  4 2

Probably more bugs exist, and the extra code paths should be eliminated
to be sure.

Thanks for maintaining datamash!



reply via email to

[Prev in Thread] Current Thread [Next in Thread]