lightning
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Lightning 2.2.1 release


From: Paul Cercueil
Subject: Re: GNU Lightning 2.2.1 release
Date: Sat, 18 Feb 2023 16:44:13 +0000

Le samedi 18 février 2023 à 13:24 -0300, Paulo César Pereira de Andrade
a écrit :
> Em sáb., 18 de fev. de 2023 às 11:40, Paul Cercueil
> <paul@crapouillou.net> escreveu:
> > 
> > Le samedi 18 février 2023 à 11:07 -0300, Paulo César Pereira de
> > Andrade
> > a écrit :
> > > Em sáb., 18 de fev. de 2023 às 09:29, Paul Cercueil
> > > <paul@crapouillou.net> escreveu:
> > > > 
> > > > Hi Paulo,
> > > 
> > >   Hi Paul,
> > > 
> > > > Le vendredi 17 février 2023 à 16:23 -0300, Paulo César Pereira
> > > > de
> > > > Andrade a écrit :
> > > > > GNU lightning 2.2.1 released!
> > > > > 
> > > > > GNU lightning is a library to aid in making portable programs
> > > > > that compile assembly code at run time.
> > > > > 
> > > > > Development:
> > > > > http://git.savannah.gnu.org/cgit/lightning.git
> > > > > 
> > > > > Download release:
> > > > > ftp://ftp.gnu.org/gnu/lightning/lightning-2.2.1.tar.gz
> > > > > 
> > > > >   GNU Lightning 2.2.1 main new features:
> > > > > 
> > > > > o Variable stack framesize implemented for aarch64, arm,
> > > > > i686,
> > > > > mips,
> > > > >   riscv, loongarch and x86_64. This means function calls use
> > > > > only
> > > > >   the minimum required stack space for prolog and epilog.
> > > > > o Optimization of prolog and epilog to not create a frame
> > > > > pointer
> > > > > if
> > > > >   not required, and not even save and restore the stack
> > > > > pointer
> > > > > if
> > > > >   not required on a leaf function. These features implemented
> > > > > for
> > > > > the
> > > > >   ports with variable stack framesize.
> > > > > o New clor, czr, ctor and ctzr instructions, that count
> > > > > leading/trailing
> > > > >   zeros/ones. These use hardware implementation when
> > > > > available,
> > > > > otherwise
> > > > >   fallback to a software implementation.
> > > > 
> > > > That's great. I actually had an alpha version of a patch that
> > > > added
> > > > clzr but never finished it.
> > > > 
> > > > I think you could add an extra one, clsr, "count leading sign
> > > > bits".
> > > > The fallback should be very easy:
> > > > 
> > > > jit_rshi(rn(tmp), r1, __WORDSIZE - 1);
> > > > jit_xorr(rn(tmp), r1, rn(tmp));
> > > > jit_clzr(r0, rn(tmp));
> > > 
> > >   Yes. Fallback is simple. If I recall correctly, only arm64 has
> > > it
> > > in hardware:
> > > 
> > > https://developer.arm.com/documentation/dui0801/h/A64-General-Instructions/CLS
> > > 
> > >   I used it in the first version of clor for aarch64 when
> > > experimenting with
> > > instruction, but it did require branch, so, changed to just
> > > invert
> > > bits and
> > > use clz:
> > > https://git.savannah.gnu.org/cgit/lightning.git/commit/?id=561eed91500f2a31ed9d4305c91940e742613ba8
> > > 
> > > > Maybe adapted to only return the number of sign bits after the
> > > > MSB
> > > > to
> > > > match GCC's __builtin_clrsb(), if it makes more sense.
> > > > 
> > > > Speaking about fallbacks, the ones in place look very
> > > > ineffective
> > > > (e.g.
> > > > the bit-swap to count trailing bits). I'm sure there are better
> > > > algorithms; I'll have a look.
> > > 
> > >   It is not even in jit_fallback.c. It is a version without
> > > lookup
> > > tables nor
> > > branches. I think libgcc variants use lookup tables. This is
> > > something
> > > to optimize.
> > 
> > My point was that there are better ways to count trailing bits than
> > bit-swapping.
> 
>   Sure. I just did want to have it working. Not fully optimized in
> the first version :) Optimized versions should be with a lookup table
> or some "magic" with float/double.
>   There is also the comment in check/bit.c that says if the fallback
> is used, it would be better to implement it as a function, then, it
> just implements the fallbacks as jit functions.
>   Using check/bit.tst is a good way to experiment with different
> versions, before converting it to C code. Just change the "#if 0"
> to "#if 1" and rewrite clo, clz, cto and ctz as appropriate, and
> check output to validate it is correct.
> 
> > >    It is also a good extension for extra Lightning instructions.
> > > At
> > > least
> > > aarch64 and loongarch have a bit swap/invert instruction:
> > > https://developer.arm.com/documentation/dui0801/h/A64-General-Instructions/RBIT
> > > https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#_bitrev_wd
> > > 
> > > > Also, you added SLL opcodes to "sign extend top 32 bits" on
> > > > MIPS,
> > > > but
> > > > you do that if (__WORDSIZE == 32). What "top 32 bits" are we
> > > > talking
> > > > about there?
> > > 
> > >   It is a SLL(r0, r1, 0) that is supposed to sign extend the
> > > value. I
> > > do not
> > > have access to any mips release 6, so did not test the mips6_p()
> > > code
> > > variant.
> > 
> > I tested MIPSr6 a few months ago and it didn't go very well, some
> > instructions that Lightning emit did change (for instance, the
> > LO/HI
> > registers are gone, and all opcodes touching those changed).
> 
>   Did you test in real hardware or qemu?
> 
>   I might setup a qemu environment, but would be far better to test
> in real hardware. Qemu mips emulation last time I tested was way
> too slow...

That was under Qemu; I don't have such hardware.

As I'm using qemu-user it doesn't have to emulate the full system and
the speed is quite OK.

> > > The documentation I did use (MD00087-2B-MIPS64BIS-AFP-6.06.pdf)
> > > says:
> > > 
> > > """
> > > Format: CLO rd, rs                                 MIPS32
> > > Purpose: Count Leading Ones in Word
> > > To count the number of leading ones in a word.
> > > ...
> > > Restrictions:
> > > Pre-Release 6: To be compliant with the MIPS32 and MIPS64
> > > Architecture, software must place the same GPR num-
> > > ber in both the rt and rd fields of the instruction. The
> > > operation of
> > > the instruction is UNPREDICTABLE if the rt and
> > > rd fields of the instruction contain different values. Release
> > > 6’s
> > > new
> > > instruction encoding does not contain an rt field.
> > > 
> > > If GPR rs does not contain a sign-extended 32-bit value (bits
> > > 63..31
> > > equal), then the results of the operation are
> > > UNPREDICTABLE.
> > > """
> > 
> > Yes, but in the case where __WORDSIZE == 32, bits 63..32 do not
> > exist.
> > Therefore the sign-extension does nothing.
> 
>   The common case is a 32 bit OS in a 64 bit cpu. This is also how it
> was tested. If the condition of a "true" 32 bit cpu can be detected,
> then could add a jit_cpu_t flag to know about it, and omit the sign
> extension.

It doesn't matter whether the CPU is 32 or 64 bits: if you are only
ever generating MIPS32 opcodes (aka. no DSLL etc., which is the case
when __WORDSIZE == 32), then the upper 32 bits will always be sign-
extended.

> > >   I did Lightning 2.2.1 release to have public several bug fixes,
> > > but
> > > I hope to add extra bit manipulation instructions. At least:
> > > 
> > > o bit invert
> > > o popcount
> > > o bit rotate
> > > 
> > >   But there are several other that are useful, like ways to
> > > create
> > > bit patterns for any kind of masks. These could at least be used
> > > internally to create constants with repeated patterns.
> > > 
> > >   If you have other suggestions for new instructions, please let
> > > me
> > > now :)
> > 
> > Honestly, apart from the "CLS" mentioned before and maybe popcount,
> > I
> > wouldn't have any use for these - in my particular usecase anyway.
> > 
> > I would maybe benefit from having "mask extract" and "mask insert"
> > functions similar to EXT/INS on MIPS.
> > 
> > But in general I like that Lightning is very RISC-like and I would
> > avoid making it more complex adding instructions that would almost
> > never be used.
> > 
> > >   One such instruction could be "multiply and add", available in
> > > several
> > > cpus.
> > > 
> > >   On the long term can add int128 and complex float/double. I
> > > would
> > > like to have it, but implementing in all ports is not trivial,
> > > and
> > > would
> > > require the concept of register pairs, currently only barely used
> > > for
> > > qdiv/qmul, and only to put the result pair, not as input.
> > > 
> > >   Maybe could add a way to inject machine code also, just memcpy
> > > a buffer. This could allow to make optimizations where lightning
> > > does
> > > not generate good code, just experiment it with an assembler,
> > > then,
> > > when happy with the code, inject it in the jit code.
> > 
> > One thing somewhat related that would be very useful to me, is
> > patchable jumps after code generation.
> > 
> > Basically, if you emit:
> > 
> > lbl = jit_jmpi();
> > jit_patch_abs(lbl, my_fn);
> > 
> > ...
> > jit_emit();
> > addr = jit_address(lbl);
> > 
> > You would then be able to change the function called using
> > something
> > like:
> > 
> > jit_patch_again(addr, my_other_fn);
> 
>   It would be required to unmap and remap the code buffer.
> 
>   Part of it is done in the example in check/protect.c. After
> that, currently would need to manually patch it, basically copying
> the _patch_at() specific to the architecture where it is implemented.
>   If it is not really in some inner loop that needs to be as fast as
> possible, could load the pointer from a constant pool.

Loading the jump target from a constant pool would work but it kinds of
defeat the purpose - the goal is to make it easier for the CPU's branch
target prediction.

Cheers,
-Paul

> > > > > o Correct several bugs with jit_arg_register_p and
> > > > > jit_putarg{r,i}{_f,_d}.
> > > > >   These bugs were not noticed earlier due to an incorrect
> > > > > check
> > > > > for
> > > > >   correctness in check/carg.c.
> > > > > o Add rip relative addressing support for x86_64 and shorter
> > > > > signed
> > > > > 64
> > > > >   bit constant load if the constant fits in a signed 32 bit
> > > > > integer.
> > > > >   This significantly reduces code size generation.
> > > > > o Correct bugs in branch generation code for pppc and sparc.
> > > > > o Correct bug in signed 32 bit integer load in ppc 64 bits.
> > > > > o Add short relative unconditional branches and calls to
> > > > > mips,
> > > > > reducing
> > > > >   code size generation.
> > > > > o And several extra minor optimizations.
> > > > > 
> > > 
> > > Thanks,
> > > Paulo
> > 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]