lightning
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Lightning 2.2.1 release


From: Paulo César Pereira de Andrade
Subject: Re: GNU Lightning 2.2.1 release
Date: Sat, 18 Feb 2023 16:48:00 -0300

Em sáb., 18 de fev. de 2023 às 13:44, Paul Cercueil
<paul@crapouillou.net> escreveu:
>
> Le samedi 18 février 2023 à 13:24 -0300, Paulo César Pereira de Andrade
> a écrit :
> > Em sáb., 18 de fev. de 2023 às 11:40, Paul Cercueil
> > <paul@crapouillou.net> escreveu:
> > >
> > > Le samedi 18 février 2023 à 11:07 -0300, Paulo César Pereira de
> > > Andrade
> > > a écrit :
> > > > Em sáb., 18 de fev. de 2023 às 09:29, Paul Cercueil
> > > > <paul@crapouillou.net> escreveu:
> > > > >
> > > > > Hi Paulo,
> > > >
> > > >   Hi Paul,
> > > >
> > > > > Le vendredi 17 février 2023 à 16:23 -0300, Paulo César Pereira
> > > > > de
> > > > > Andrade a écrit :
> > > > > > GNU lightning 2.2.1 released!
> > > > > >
> > > > > > GNU lightning is a library to aid in making portable programs
> > > > > > that compile assembly code at run time.
> > > > > >
> > > > > > Development:
> > > > > > http://git.savannah.gnu.org/cgit/lightning.git
> > > > > >
> > > > > > Download release:
> > > > > > ftp://ftp.gnu.org/gnu/lightning/lightning-2.2.1.tar.gz
> > > > > >
> > > > > >   GNU Lightning 2.2.1 main new features:
> > > > > >
> > > > > > o Variable stack framesize implemented for aarch64, arm,
> > > > > > i686,
> > > > > > mips,
> > > > > >   riscv, loongarch and x86_64. This means function calls use
> > > > > > only
> > > > > >   the minimum required stack space for prolog and epilog.
> > > > > > o Optimization of prolog and epilog to not create a frame
> > > > > > pointer
> > > > > > if
> > > > > >   not required, and not even save and restore the stack
> > > > > > pointer
> > > > > > if
> > > > > >   not required on a leaf function. These features implemented
> > > > > > for
> > > > > > the
> > > > > >   ports with variable stack framesize.
> > > > > > o New clor, czr, ctor and ctzr instructions, that count
> > > > > > leading/trailing
> > > > > >   zeros/ones. These use hardware implementation when
> > > > > > available,
> > > > > > otherwise
> > > > > >   fallback to a software implementation.
> > > > >
> > > > > That's great. I actually had an alpha version of a patch that
> > > > > added
> > > > > clzr but never finished it.
> > > > >
> > > > > I think you could add an extra one, clsr, "count leading sign
> > > > > bits".
> > > > > The fallback should be very easy:
> > > > >
> > > > > jit_rshi(rn(tmp), r1, __WORDSIZE - 1);
> > > > > jit_xorr(rn(tmp), r1, rn(tmp));
> > > > > jit_clzr(r0, rn(tmp));
> > > >
> > > >   Yes. Fallback is simple. If I recall correctly, only arm64 has
> > > > it
> > > > in hardware:
> > > >
> > > > https://developer.arm.com/documentation/dui0801/h/A64-General-Instructions/CLS
> > > >
> > > >   I used it in the first version of clor for aarch64 when
> > > > experimenting with
> > > > instruction, but it did require branch, so, changed to just
> > > > invert
> > > > bits and
> > > > use clz:
> > > > https://git.savannah.gnu.org/cgit/lightning.git/commit/?id=561eed91500f2a31ed9d4305c91940e742613ba8
> > > >
> > > > > Maybe adapted to only return the number of sign bits after the
> > > > > MSB
> > > > > to
> > > > > match GCC's __builtin_clrsb(), if it makes more sense.
> > > > >
> > > > > Speaking about fallbacks, the ones in place look very
> > > > > ineffective
> > > > > (e.g.
> > > > > the bit-swap to count trailing bits). I'm sure there are better
> > > > > algorithms; I'll have a look.
> > > >
> > > >   It is not even in jit_fallback.c. It is a version without
> > > > lookup
> > > > tables nor
> > > > branches. I think libgcc variants use lookup tables. This is
> > > > something
> > > > to optimize.
> > >
> > > My point was that there are better ways to count trailing bits than
> > > bit-swapping.
> >
> >   Sure. I just did want to have it working. Not fully optimized in
> > the first version :) Optimized versions should be with a lookup table
> > or some "magic" with float/double.
> >   There is also the comment in check/bit.c that says if the fallback
> > is used, it would be better to implement it as a function, then, it
> > just implements the fallbacks as jit functions.
> >   Using check/bit.tst is a good way to experiment with different
> > versions, before converting it to C code. Just change the "#if 0"
> > to "#if 1" and rewrite clo, clz, cto and ctz as appropriate, and
> > check output to validate it is correct.

  Quick adaptation from
http://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightModLookup
.data ...
#if __WORDSIZE == 64
mod67:
.c    64 0 1 39 2 15 40 23 3 12 16 59 41 19 24 54 4 0 13 10 17 62 60
28 42 30 20 51 25 44 55 47 5 32 0 38 14 22 11 58 18 53 63 9 61 27 29
50 43 46 31 37 21 57 52 8 26 49 45 36 56 7 48 35  6 34 33
#else
mod37:
.c    32 0 1 26 2 23 27 0 3 16 24 30 28 11 0 13 4 7 17 0 25 22 31 15
29 10 12 6 0 21 14 9 5 20 8 19 18
#endif
...
.code
...
/*
    jit_uword_t ctz(jit_uword_t r1) {
#if __WORDSIZE == 32
        static const int mod37[] = {
        32,  0,  1, 26,  2, 23, 27,  0,  3, 16, 24, 30, 28, 11,  0, 13,
         4,  7, 17,  0, 25, 22, 31, 15, 29, 10, 12,  6,  0, 21, 14,  9,
         5, 20, 8, 19, 18
        };
        return mod37[(-r1 & r1) % 37];
#else
        static const int mod67[] = {
        64,  0,  1, 39,  2, 15, 40, 23,  3, 12, 16, 59, 41, 19, 24, 54,
         4,  0, 13, 10, 17, 62, 60, 28, 42, 30,  20, 51, 25, 44, 55, 47,
         5, 32,  0, 38, 14, 22, 11, 58, 18, 53, 63,  9, 61, 27, 29, 50,
        43, 46, 31, 37, 21, 57, 52,  8, 26, 49, 45, 36, 56,  7, 48, 35,
         6, 34, 33
        };
        return mod67[(-r1 & r1) % 67];
#endif
    }
 */
name    ctz
ctz:
    prolog
    arg $in
    getarg %r0 $in
    negr %r1 %r0
    andr %r0 %r0 %r1
#if __WORDISZE == 32
    remi_u %r0 %r0 37
    movi %r1 mod37
#else
    remi_u %r0 %r0 67
    movi %r1 mod67
#endif
    ldxr_uc %r0 %r1 %r0
    retr %r0
    epilog
...

> > > >    It is also a good extension for extra Lightning instructions.

> > > > At
> > > > least
> > > > aarch64 and loongarch have a bit swap/invert instruction:
> > > > https://developer.arm.com/documentation/dui0801/h/A64-General-Instructions/RBIT
> > > > https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#_bitrev_wd
> > > >
> > > > > Also, you added SLL opcodes to "sign extend top 32 bits" on
> > > > > MIPS,
> > > > > but
> > > > > you do that if (__WORDSIZE == 32). What "top 32 bits" are we
> > > > > talking
> > > > > about there?
> > > >
> > > >   It is a SLL(r0, r1, 0) that is supposed to sign extend the
> > > > value. I
> > > > do not
> > > > have access to any mips release 6, so did not test the mips6_p()
> > > > code
> > > > variant.
> > >
> > > I tested MIPSr6 a few months ago and it didn't go very well, some
> > > instructions that Lightning emit did change (for instance, the
> > > LO/HI
> > > registers are gone, and all opcodes touching those changed).

  I see. I did only search for new documentation to find information
about hardware clz, clo, etc.

> >   Did you test in real hardware or qemu?
> >
> >   I might setup a qemu environment, but would be far better to test
> > in real hardware. Qemu mips emulation last time I tested was way
> > too slow...
>
> That was under Qemu; I don't have such hardware.
>
> As I'm using qemu-user it doesn't have to emulate the full system and
> the speed is quite OK.

  Ok. I will put it in my TODO to test and make it work with the changes
for mips release 6.

> > > > The documentation I did use (MD00087-2B-MIPS64BIS-AFP-6.06.pdf)
> > > > says:
> > > >
> > > > """
> > > > Format: CLO rd, rs                                 MIPS32
> > > > Purpose: Count Leading Ones in Word
> > > > To count the number of leading ones in a word.
> > > > ...
> > > > Restrictions:
> > > > Pre-Release 6: To be compliant with the MIPS32 and MIPS64
> > > > Architecture, software must place the same GPR num-
> > > > ber in both the rt and rd fields of the instruction. The
> > > > operation of
> > > > the instruction is UNPREDICTABLE if the rt and
> > > > rd fields of the instruction contain different values. Release
> > > > 6’s
> > > > new
> > > > instruction encoding does not contain an rt field.
> > > >
> > > > If GPR rs does not contain a sign-extended 32-bit value (bits
> > > > 63..31
> > > > equal), then the results of the operation are
> > > > UNPREDICTABLE.
> > > > """
> > >
> > > Yes, but in the case where __WORDSIZE == 32, bits 63..32 do not
> > > exist.
> > > Therefore the sign-extension does nothing.
> >
> >   The common case is a 32 bit OS in a 64 bit cpu. This is also how it
> > was tested. If the condition of a "true" 32 bit cpu can be detected,
> > then could add a jit_cpu_t flag to know about it, and omit the sign
> > extension.
>
> It doesn't matter whether the CPU is 32 or 64 bits: if you are only
> ever generating MIPS32 opcodes (aka. no DSLL etc., which is the case
> when __WORDSIZE == 32), then the upper 32 bits will always be sign-
> extended.

  Not always, when creating constants top 32 bit might have wrong
sign extension, due to:

#  if __WORDSIZE == 32
#    define can_sign_extend_int_p(im)    1
#    define can_zero_extend_int_p(im)    1
#  else
...

It appears CLO, and CLZ are special cases about the top 32 bits.

> > > >   I did Lightning 2.2.1 release to have public several bug fixes,
> > > > but
> > > > I hope to add extra bit manipulation instructions. At least:
> > > >
> > > > o bit invert
> > > > o popcount
> > > > o bit rotate
> > > >
> > > >   But there are several other that are useful, like ways to
> > > > create
> > > > bit patterns for any kind of masks. These could at least be used
> > > > internally to create constants with repeated patterns.
> > > >
> > > >   If you have other suggestions for new instructions, please let
> > > > me
> > > > now :)
> > >
> > > Honestly, apart from the "CLS" mentioned before and maybe popcount,
> > > I
> > > wouldn't have any use for these - in my particular usecase anyway.
> > >
> > > I would maybe benefit from having "mask extract" and "mask insert"
> > > functions similar to EXT/INS on MIPS.
> > >
> > > But in general I like that Lightning is very RISC-like and I would
> > > avoid making it more complex adding instructions that would almost
> > > never be used.
> > >
> > > >   One such instruction could be "multiply and add", available in
> > > > several
> > > > cpus.
> > > >
> > > >   On the long term can add int128 and complex float/double. I
> > > > would
> > > > like to have it, but implementing in all ports is not trivial,
> > > > and
> > > > would
> > > > require the concept of register pairs, currently only barely used
> > > > for
> > > > qdiv/qmul, and only to put the result pair, not as input.
> > > >
> > > >   Maybe could add a way to inject machine code also, just memcpy
> > > > a buffer. This could allow to make optimizations where lightning
> > > > does
> > > > not generate good code, just experiment it with an assembler,
> > > > then,
> > > > when happy with the code, inject it in the jit code.
> > >
> > > One thing somewhat related that would be very useful to me, is
> > > patchable jumps after code generation.
> > >
> > > Basically, if you emit:
> > >
> > > lbl = jit_jmpi();
> > > jit_patch_abs(lbl, my_fn);
> > >
> > > ...
> > > jit_emit();
> > > addr = jit_address(lbl);
> > >
> > > You would then be able to change the function called using
> > > something
> > > like:
> > >
> > > jit_patch_again(addr, my_other_fn);
> >
> >   It would be required to unmap and remap the code buffer.
> >
> >   Part of it is done in the example in check/protect.c. After
> > that, currently would need to manually patch it, basically copying
> > the _patch_at() specific to the architecture where it is implemented.
> >   If it is not really in some inner loop that needs to be as fast as
> > possible, could load the pointer from a constant pool.
>
> Loading the jump target from a constant pool would work but it kinds of
> defeat the purpose - the goal is to make it easier for the CPU's branch
> target prediction.

  It would basically be self modifiying code. The full approach should
be to generate the longest sequence, the most expensive is movi_p
followed by a jmpr. Then, rewrite the jump; the worst case would be
a very far jump, that would still need the movi_p/jmpr; and fill extra
instructions with nops if can encode a short jump. Note that for mips
would also need to not use the 'swap_ds' logic.

> Cheers,
> -Paul
>
> > > > > > o Correct several bugs with jit_arg_register_p and
> > > > > > jit_putarg{r,i}{_f,_d}.
> > > > > >   These bugs were not noticed earlier due to an incorrect
> > > > > > check
> > > > > > for
> > > > > >   correctness in check/carg.c.
> > > > > > o Add rip relative addressing support for x86_64 and shorter
> > > > > > signed
> > > > > > 64
> > > > > >   bit constant load if the constant fits in a signed 32 bit
> > > > > > integer.
> > > > > >   This significantly reduces code size generation.
> > > > > > o Correct bugs in branch generation code for pppc and sparc.
> > > > > > o Correct bug in signed 32 bit integer load in ppc 64 bits.
> > > > > > o Add short relative unconditional branches and calls to
> > > > > > mips,
> > > > > > reducing
> > > > > >   code size generation.
> > > > > > o And several extra minor optimizations.
> > > > > >
> > > >
> > > > Thanks,
> > > > Paulo
> > >
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]