[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemudevel] [PATCH v6 08/15] qdist: add module to represent frequen
From: 
Emilio G. Cota 
Subject: 
Re: [Qemudevel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data 
Date: 
Wed, 8 Jun 2016 14:06:45 0400 
Useragent: 
Mutt/1.5.23 (20140312) 
On Wed, Jun 08, 2016 at 17:10:03 +0300, Sergey Fedorov wrote:
> On 08/06/16 03:02, Emilio G. Cota wrote:
> >  dist>entries = g_realloc(dist>entries,
> >  sizeof(*dist>entries) * (dist>n + 1));
> > + if (unlikely(dist>n == dist>size)) {
> > + dist>size = dist>size ? dist>size * 2 : 1;
>
> We could initialize 'dist>size' to 1 and allocate a 1entry
> 'dist>entries' array in qdist_init() to avoid this ternary operation ;)
Done. This resulted in quite a few modifications, since dist>entries == NULL
had been used as an equivalent to dist>n == 0.
> >>>> (snip)
> >> So our scale is not linear. I think some users might get confused by this.
> > That's correct. I think specialcasing 0 makes sense though, since
> > it increases the signaltonoise ratio of the histogram. For example:
> >
> > 1) 0 as ' ':
> > TB hash occupancy 31.84% avg chain occ. Histogram: [0,10)%▆ █
> > ▅▁▃▁▁[90,100]%
> > TB hash avg chain 1.015 buckets. Histogram: 1█▁▁3
> >
> > 2) 0 as '1/8':
> > TB hash occupancy 32.07% avg chain occ. Histogram:
> > [0,10)%▆▁█▁▁▅▁▃▁▁[90,100]%
> > TB hash avg chain 1.015 buckets. Histogram: 1▇▁▁3
> >
> > I think in these examples most users would be less confused by 1) than by
> > 2).
>
> I was meaning to represent all bars whose value < 1/8 as a space, not
> only whose value is pure zero. Otherwise we can see 1/8 bar where the
> actual value is negligibly differ from zero as in the second example.
I see. That would be (3):
TB hash occupancy 32.79% avg chain occ. Histogram: [0,10)%▅ █ ▅ ▂
[90,100]%
TB hash avg chain 1.017 buckets. Histogram: 1█ 3
I still think (1) is the representation that gives the most information.
IMO it's valuable that "close to zero" and "zero" are represented differently,
in the same way that max and "close to max" are represented differently
as well (only max gets 8/8).
BTW, while looking into this I fixed a bug; sometimes we'd print 7/8 instead
of 8/8 for the max value, due to the ordering of FP computations
[see 13 in (2) above; it's 7/8 instead of 8/8]. Fixed with:
diff git a/util/qdist.c b/util/qdist.c
index cfe09e6..7842d34 100644
 a/util/qdist.c
+++ b/util/qdist.c
@@ 103,7 +103,7 @@ static const gunichar qdist_blocks[] = {
*/
static char *qdist_pr_internal(const struct qdist *dist)
{
 double min, max, step;
+ double min, max;
GString *s = g_string_new("");
size_t i;
@@ 131,16 +131,14 @@ static char *qdist_pr_internal(const struct qdist *dist)
}
}
 /* floor((count  min) * step) will give us the block index */
 step = (QDIST_NR_BLOCK_CODES  1) / (max  min);

for (i = 0; i < dist>n; i++) {
struct qdist_entry *e = &dist>entries[i];
int index;
/* make an exception with 0; instead of using block[0], print a space
*/
if (e>count) {
 index = (int)((e>count  min) * step);
+ /* divide first to avoid loss of precision when e>count == max */
+ index = (e>count  min) / (max  min) * (QDIST_NR_BLOCK_CODES 
1);
g_string_append_unichar(s, qdist_blocks[index]);
} else {
g_string_append_c(s, ' ');
I also added a test to testqdist (called "test_bin_precision") that
checks for this.
> > The behaviour isn't the same though. With this we have
> > that the two outer bins (leftmost and rightmost) are unnecessarily
> > large (since they're out of the range of the input data).
> >
> > For example, assume the data is between 0 and 100 and n=5 (i.e. step=25),
> > it makes no sense to report the first bin as [12.5,12.5). If we
> > then truncate the unnecessary edges, we'd have [0,12.5), but
> > then the second bin is [12.5,37.5). Bins of unequal size are
> > possible (although a bit unusual) in histograms, but given
> > our Unicodebased representation, we're limited to samewidth bars.
>
> That is why I noted that I'm not sure what is the most correct from
> mathematical point of view. Maybe consider the second option? I.e.
> rounding to the middle of each bin with:
>
> x = left + step / 2;
>
> which would give the picture like this:
>
>
> xmin [*****] xmax  from
>     
> \ / \ / \ / \ /
>    
> V V V V
> [* * * *]  to
This binning is equivalent to what we do right now.
The only difference is where the value is set (either at the left
of the bin, or at the center as above); this, however, isn't too
important, since this value is only used when printing
the labels, i.e. we could print [left, left+step) or
[centerstep/2, center+step/2) and still get the same results.
> Anyway, you may consider if you like whether it's possible to apply some
> simplifications from my code to the final version.
OK. This is how it looks like:
diff git a/util/qdist.c b/util/qdist.c
index 7842d34..3ca2227 100644
 a/util/qdist.c
+++ b/util/qdist.c
@@ 163,7 +163,7 @@ void qdist_bin__internal(struct qdist *to, const struct
qdist *from, size_t n)
{
double xmin, xmax;
double step;
 size_t i, j, j_min;
+ size_t i, j;
qdist_init(to);
@@ 194,7 +194,7 @@ void qdist_bin__internal(struct qdist *to, const struct
qdist *from, size_t n)
}
rebin:
 j_min = 0;
+ j = 0;
for (i = 0; i < n; i++) {
double x;
double left, right;
@@ 210,19 +210,13 @@ void qdist_bin__internal(struct qdist *to, const struct
qdist *from, size_t n)
* To avoid doublecounting we capture [left, right) ranges, except for
* the righmost bin, which captures a [left, right] range.
*/
 for (j = j_min; j < from>n; j++) {
+ while (j < from>n &&
+ (from>entries[j].x < right 
+ (i == n  1 && from>entries[j].x == right))) {
struct qdist_entry *o = &from>entries[j];
 /* entries are ordered so do not check beyond right */
 if (o>x > right) {
 break;
 }
 if (o>x >= left && (o>x < right 
 (i == n  1 && o>x == right))) {
 qdist_add(to, x, o>count);
 /* don't check this entry again */
 j_min = j + 1;
 }
+ qdist_add(to, x, o>count);
+ j++;
}
}
}
Thanks,
Emilio
 Re: [Qemudevel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data, (continued)