rapp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Rapp-dev] Pixelwise threshold and conditional pixel operations path


From: Hans-Peter Nilsson
Subject: Re: [Rapp-dev] Pixelwise threshold and conditional pixel operations pathes
Date: Fri, 27 May 2016 07:25:54 +0200

> From: Willie Betschart <address@hidden>
> Date: Wed, 4 May 2016 08:54:01 +0000

> Content-Type: text/x-patch; name="cond-pixop-add-v3.patch"

Thanks.  Similarly to the other patch, committed with minor
changes.  For the record:

> diff --git a/benchmark/rapp_benchmark.c b/benchmark/rapp_benchmark.c

>  static void
> +rapp_bmark_exec_cond_u8(int (*func)(), const int *args)
> +{
> +    const rapp_bmark_data_t *data = &rapp_bmark_data;
> +    int                      idx  = args[0];
> +    (*func)(data->dst, data->dim_u8,
> +            data->src[idx], data->dim_bin,
> +            data->width, data->height, args[1]);
> +}
> +
> +static void
>  rapp_bmark_exec_cond_set_u8(int (*func)(), const int *args)
>  {
>      const rapp_bmark_data_t *data = &rapp_bmark_data;
>      int                      idx  = args[0];
>      (*func)(data->dst, data->dim_u8,
>              data->src[idx], data->dim_bin,
> -            data->width, data->height, 0);
> +            data->width, data->height, args[1]);
> +}
> +
> +static void
> +rapp_bmark_exec_cond_u8_u8(int (*func)(), const int *args)
> +{
> +    const rapp_bmark_data_t *data = &rapp_bmark_data;
> +    int                      idx  = args[0];
> +    (*func)(data->dst, data->dim_u8,
> +            data->set, data->dim_u8,
> +            data->src[idx], data->dim_bin,
> +            data->width, data->height, args[1]);
>  }

Looks like you added rapp_bmark_exec_cond_u8_u8 and
rapp_bmark_exec_cond_u8, but *also* changed
rapp_bmark_exec_cond_set_u8 to be identical to the new
rapp_bmark_exec_cond_u8.  That must be a thinko, so I just
dropped the new rapp_bmark_exec_cond_u8 and used cond_set_u8.

> diff --git a/compute/backend/rc_vec_neon.h b/compute/backend/rc_vec_neon.h

> +#define RC_VEC_SETMASKV(vec, maskv)                      \
> +do {                                                     \
> +    rc_vec_t v_, andv_;                                  \
> +    rc_vec_t mask_ = (rc_vec_t){1<<0, 1<<1, 1<<2, 1<<3,  \
> +                                1<<4, 1<<5, 1<<6, 1<<7}; \
> +    uint8_t indx0_ = vget_lane_u8(maskv, 0);             \
> +    RC_VEC_SPLAT(v_, indx0_);                            \
> +    RC_VEC_AND(andv_, v_, mask_);                        \
> +    (vec) = vceq_u8(andv_, mask_);                       \
> +} while (0)

Any particular reason to not use vtst instead of the and + vceq?
(If so, that should be expressed in a comment here.)
Reality (in the form of ARTPEC-6) seemed to fit the NEON
documentation I found so that's what I committed.

> diff --git a/compute/generic/rc_cond.c b/compute/generic/rc_cond.c

> +#define RC_COND_PIXOP_TEMPLATE(op1, pixop, arg1, arg2, mask)  \

The arg2 was unused so I dropped that.
(Too much copy-pasting from rc_pixop.c I'm afraid; if we *need*
that argument we'll add it later.)

> +#define RC_COND_WORD_TEMPLATE(dst, pixop, arg1, arg2, word)              \

> +                unsigned d1_ = RC_WORD_EXTRACT(dst32_,  0, 8);           \
> +                unsigned d2_ = RC_WORD_EXTRACT(dst32_,  8, 8);           \
> +                unsigned d3_ = RC_WORD_EXTRACT(dst32_, 16, 8);           \
> +                unsigned d4_ = RC_WORD_EXTRACT(dst32_, 24, 8);           \
> +                                                                         \
> +                /* Apply pixop. */                                       \
> +                pixop(d1_, arg1, arg2);                                  \
> +                pixop(d2_, arg1, arg2);                                  \
> +                pixop(d3_, arg1, arg2);                                  \
> +                pixop(d4_, arg1, arg2);                                  \
> +                                                                         \
> +                *d32_ = RC_WORD_INSERT(d1_,  0, 8) |                     \
> +                        RC_WORD_INSERT(d2_,  8, 8) |                     \
> +                        RC_WORD_INSERT(d3_, 16, 8) |                     \
> +                        RC_WORD_INSERT(d4_, 24, 8);                      \

I changed these 4-bytes-expansions into loops.  GCC will unroll
that and nobody should have to see the repetitions.

> +#if RC_IMPL(rc_cond_addc_u8, 1)
> +void
> +rc_cond_addc_u8(uint8_t *restrict dst, int dst_dim,
> +                const uint8_t *restrict map, int map_dim,
> +                int width, int height, int value)
> +{
> +    if (value > 0) {
> +        /* Positive value - use ADDS */
> +        RC_COND_TEMPLATE(dst, dst_dim, map, map_dim, width, height,
> +                         RC_PIXOP_ADDS, value, 0);
> +    }
> +    else if (value < 0) {
> +        /* Negative value - use SUBS */
> +        value = -value;
> +        RC_COND_TEMPLATE(dst, dst_dim, map, map_dim, width, height,
> +                         RC_PIXOP_SUBS, value, 0);
> +    }
> +}
> +#endif

The condition is better put in the driver when choosing between
different vector functions, so I separated these into
rc_cond_addc and rc_cond_subc.  Not that I expect different
tunings between adds and subs, but it's one less thing to handle
in the innermost function.

> diff --git a/compute/vector/rc_cond.c b/compute/vector/rc_cond.c
> +#define RC_PIXOP_COPY(vec1, vec2, arg1, arg2) \
> +    ((vec1) = (vec2))

Similar confusing abundance of unused arguments.  Right,
copy-pasto; I did the same for rc_type.c, now fixed.

> +#define RC_COND_SINGLE_ITER_MAX(max, buf, map, j, i, pixop, \
> +                                arg1, arg2, arg3)           \
> +do {                                                        \
> +    rc_vec_t mv_;                                           \
> +    int k_, cnt_;                                           \
> +    RC_VEC_LOAD(mv_, &(map)[(i)]);                          \
> +    RC_COND_COUNT(cnt_, mv_);                               \
> +    if (cnt_ > 0) {                                         \
> +        for (k_ = 0; k_ < max; k_++, (j) += RC_VEC_SIZE) {  \
> +            rc_vec_t dv_, sv_, tv_;                         \
> +            rc_vec_t exp_mv_, cdv_, cv1_, cv2_;             \
> +                                                            \
> +            /* Run standard pixop. */                       \
> +            RC_PIXOP_ITER(buf, j, sv_, cdv_,                \
> +                          pixop, arg1, arg2, arg3);         \
> +                                                            \
> +            /* Conditional part. */                         \
> +            RC_VEC_SETMASKV(exp_mv_, mv_);                  \
> +            RC_VEC_ANDNOT(cv1_, cdv_, exp_mv_);             \
> +            RC_VEC_AND(cv2_, sv_, exp_mv_);                 \
> +            RC_VEC_OR(dv_, cv1_, cv2_);                     \
> +            RC_VEC_SHLC(tv_, mv_, RC_VEC_SIZE / 8);         \
> +            mv_ = tv_;                                      \
> +            RC_VEC_STORE(&(buf)[(j)], dv_);                 \
> +        }                                                   \
> +    }                                                       \
> +    else {                                                  \
> +        (j) += (max) * RC_VEC_SIZE;                         \
> +    }                                                       \
> +    (i) += RC_VEC_SIZE;                                     \
> +} while (0)

Expanding the conditional using the c=(a & m) | (b & ~m) idiom
certainly works everywhere, but maybe it'd be a good idea to
also add some vector macro that picks one of two vectors, or
maybe a masked write; maybe something combining the above
SETMASKV+ANDNOT+AND+OR+STORE and also letting the backend decide
whether it's worthwhile to check for zero mask (it certainly is
with the above expansion).

Also, I thought it'd be better to do the unrolling at the above
*innermost* level rather than at this level just outside:

> +#define RC_COND_PIXOP_TEMPLATE(dst, dst_dim, map, map_dim,                  \
> +                               width, height, pixop, arg1, arg2, arg3,      \
> +                               unroll)                                      \
> +do {                                                                        \
> +    /* We use the total number of destination vectors as the base. */       \
> +    int tot_ = RC_DIV_CEIL((width), RC_VEC_SIZE * 8 / 8);                   \
> +                                                                            \
> +    /* We count whole source vectors for unrolling. */                      \
> +    int len_ = tot_ / (8 * unroll);                                         \
> +    int rem_ = tot_ % (8 * unroll);                                         \
> +    int y_;                                                                 \

...but after benchmarking that, it was mostly a wash, except for
*one* newly added vector implementation where that was
noticeably worse.  Reality strikes again.

> +/* Verify if the general condition operations are supported. */
> +#if defined RC_VEC_SETMASKV && defined RC_VEC_ANDNOT \
> +    && defined RC_VEC_AND   && defined RC_VEC_OR     \
> +    && defined RC_VEC_SHLC

No, each #if (def) should sit as a guard around only the macro
(or function) using exactly those macros.  Thus better #if
define RC_COND_PIXOP_TEMPLATE.

> diff --git a/include/rapp_cond.h b/include/rapp_cond.h
> index 7ac25d7..2601309 100644
> --- a/include/rapp_cond.h
> +++ b/include/rapp_cond.h

Missing an update to the section that said that only set and
copy was implemented.

Actually, it should no longer be called "Conditional set and
copy".  Fixed, I hope ("Single Conditional Operations").  That
subsection seemed to at one time have been called "Conditional
Operations" only to be renamed, with the main section (including
scatter/gather) named "Conditional Operations".

> @@ -89,6 +89,25 @@ rapp_cond_set_u8(uint8_t *restrict dst, int dst_dim,
>                   int width, int height, unsigned value);
>  
>  /**
> + *  Add signed constant conditionally.
> + *  Computes buf[i] = buf[i] + value, where map[i] is set.
> + *  The result is saturated.

> + *
> + *  @param[out] dst      Destination pixel buffer.
> + *  @param      dst_dim  Row dimension of the destination buffer.
> + *  @param[in]  map      Binary map pixel buffer.
> + *  @param      map_dim  Row dimension of the binary map buffer.
> + *  @param      width    Image width in pixels.
> + *  @param      height   Image height in pixels.
> + *  @param      value    Add signed constant.

(To terse, range missing and also should describe the parameter,
not the operation.)

brgds, H-P



reply via email to

[Prev in Thread] Current Thread [Next in Thread]