[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GNU Lightning 2.2.1 release
From: |
Paul Cercueil |
Subject: |
Re: GNU Lightning 2.2.1 release |
Date: |
Mon, 20 Feb 2023 10:26:17 +0000 |
Hi Paulo,
Le samedi 18 février 2023 à 16:48 -0300, Paulo César Pereira de Andrade
a écrit :
> Em sáb., 18 de fev. de 2023 às 13:44, Paul Cercueil
> <paul@crapouillou.net> escreveu:
> >
> > Le samedi 18 février 2023 à 13:24 -0300, Paulo César Pereira de
> > Andrade
> > a écrit :
> > > Em sáb., 18 de fev. de 2023 às 11:40, Paul Cercueil
> > > <paul@crapouillou.net> escreveu:
> > > >
> > > > Le samedi 18 février 2023 à 11:07 -0300, Paulo César Pereira de
> > > > Andrade
> > > > a écrit :
> > > > > Em sáb., 18 de fev. de 2023 às 09:29, Paul Cercueil
> > > > > <paul@crapouillou.net> escreveu:
> > > > > >
> > > > > > Hi Paulo,
> > > > >
> > > > > Hi Paul,
> > > > >
> > > > > > Le vendredi 17 février 2023 à 16:23 -0300, Paulo César
> > > > > > Pereira
> > > > > > de
> > > > > > Andrade a écrit :
> > > > > > > GNU lightning 2.2.1 released!
> > > > > > >
> > > > > > > GNU lightning is a library to aid in making portable
> > > > > > > programs
> > > > > > > that compile assembly code at run time.
> > > > > > >
> > > > > > > Development:
> > > > > > > http://git.savannah.gnu.org/cgit/lightning.git
> > > > > > >
> > > > > > > Download release:
> > > > > > > ftp://ftp.gnu.org/gnu/lightning/lightning-2.2.1.tar.gz
> > > > > > >
> > > > > > > GNU Lightning 2.2.1 main new features:
> > > > > > >
> > > > > > > o Variable stack framesize implemented for aarch64, arm,
> > > > > > > i686,
> > > > > > > mips,
> > > > > > > riscv, loongarch and x86_64. This means function calls
> > > > > > > use
> > > > > > > only
> > > > > > > the minimum required stack space for prolog and epilog.
> > > > > > > o Optimization of prolog and epilog to not create a frame
> > > > > > > pointer
> > > > > > > if
> > > > > > > not required, and not even save and restore the stack
> > > > > > > pointer
> > > > > > > if
> > > > > > > not required on a leaf function. These features
> > > > > > > implemented
> > > > > > > for
> > > > > > > the
> > > > > > > ports with variable stack framesize.
> > > > > > > o New clor, czr, ctor and ctzr instructions, that count
> > > > > > > leading/trailing
> > > > > > > zeros/ones. These use hardware implementation when
> > > > > > > available,
> > > > > > > otherwise
> > > > > > > fallback to a software implementation.
> > > > > >
> > > > > > That's great. I actually had an alpha version of a patch
> > > > > > that
> > > > > > added
> > > > > > clzr but never finished it.
> > > > > >
> > > > > > I think you could add an extra one, clsr, "count leading
> > > > > > sign
> > > > > > bits".
> > > > > > The fallback should be very easy:
> > > > > >
> > > > > > jit_rshi(rn(tmp), r1, __WORDSIZE - 1);
> > > > > > jit_xorr(rn(tmp), r1, rn(tmp));
> > > > > > jit_clzr(r0, rn(tmp));
> > > > >
> > > > > Yes. Fallback is simple. If I recall correctly, only arm64
> > > > > has
> > > > > it
> > > > > in hardware:
> > > > >
> > > > > https://developer.arm.com/documentation/dui0801/h/A64-General-Instructions/CLS
> > > > >
> > > > > I used it in the first version of clor for aarch64 when
> > > > > experimenting with
> > > > > instruction, but it did require branch, so, changed to just
> > > > > invert
> > > > > bits and
> > > > > use clz:
> > > > > https://git.savannah.gnu.org/cgit/lightning.git/commit/?id=561eed91500f2a31ed9d4305c91940e742613ba8
> > > > >
> > > > > > Maybe adapted to only return the number of sign bits after
> > > > > > the
> > > > > > MSB
> > > > > > to
> > > > > > match GCC's __builtin_clrsb(), if it makes more sense.
> > > > > >
> > > > > > Speaking about fallbacks, the ones in place look very
> > > > > > ineffective
> > > > > > (e.g.
> > > > > > the bit-swap to count trailing bits). I'm sure there are
> > > > > > better
> > > > > > algorithms; I'll have a look.
> > > > >
> > > > > It is not even in jit_fallback.c. It is a version without
> > > > > lookup
> > > > > tables nor
> > > > > branches. I think libgcc variants use lookup tables. This is
> > > > > something
> > > > > to optimize.
> > > >
> > > > My point was that there are better ways to count trailing bits
> > > > than
> > > > bit-swapping.
> > >
> > > Sure. I just did want to have it working. Not fully optimized
> > > in
> > > the first version :) Optimized versions should be with a lookup
> > > table
> > > or some "magic" with float/double.
> > > There is also the comment in check/bit.c that says if the
> > > fallback
> > > is used, it would be better to implement it as a function, then,
> > > it
> > > just implements the fallbacks as jit functions.
> > > Using check/bit.tst is a good way to experiment with different
> > > versions, before converting it to C code. Just change the "#if 0"
> > > to "#if 1" and rewrite clo, clz, cto and ctz as appropriate, and
> > > check output to validate it is correct.
>
> Quick adaptation from
> http://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightModLookup
> .data ...
> #if __WORDSIZE == 64
> mod67:
> .c 64 0 1 39 2 15 40 23 3 12 16 59 41 19 24 54 4 0 13 10 17 62 60
> 28 42 30 20 51 25 44 55 47 5 32 0 38 14 22 11 58 18 53 63 9 61 27 29
> 50 43 46 31 37 21 57 52 8 26 49 45 36 56 7 48 35 6 34 33
> #else
> mod37:
> .c 32 0 1 26 2 23 27 0 3 16 24 30 28 11 0 13 4 7 17 0 25 22 31 15
> 29 10 12 6 0 21 14 9 5 20 8 19 18
> #endif
> ...
> .code
> ...
> /*
> jit_uword_t ctz(jit_uword_t r1) {
> #if __WORDSIZE == 32
> static const int mod37[] = {
> 32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0,
> 13,
> 4, 7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14,
> 9,
> 5, 20, 8, 19, 18
> };
> return mod37[(-r1 & r1) % 37];
> #else
> static const int mod67[] = {
> 64, 0, 1, 39, 2, 15, 40, 23, 3, 12, 16, 59, 41, 19, 24,
> 54,
> 4, 0, 13, 10, 17, 62, 60, 28, 42, 30, 20, 51, 25, 44, 55,
> 47,
> 5, 32, 0, 38, 14, 22, 11, 58, 18, 53, 63, 9, 61, 27, 29,
> 50,
> 43, 46, 31, 37, 21, 57, 52, 8, 26, 49, 45, 36, 56, 7, 48,
> 35,
> 6, 34, 33
> };
> return mod67[(-r1 & r1) % 67];
> #endif
> }
> */
> name ctz
> ctz:
> prolog
> arg $in
> getarg %r0 $in
> negr %r1 %r0
> andr %r0 %r0 %r1
> #if __WORDISZE == 32
> remi_u %r0 %r0 37
> movi %r1 mod37
> #else
> remi_u %r0 %r0 67
> movi %r1 mod67
> #endif
> ldxr_uc %r0 %r1 %r0
> retr %r0
> epilog
> ...
>
> > > > > It is also a good extension for extra Lightning
> > > > > instructions.
>
> > > > > At
> > > > > least
> > > > > aarch64 and loongarch have a bit swap/invert instruction:
> > > > > https://developer.arm.com/documentation/dui0801/h/A64-General-Instructions/RBIT
> > > > > https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#_bitrev_wd
> > > > >
> > > > > > Also, you added SLL opcodes to "sign extend top 32 bits" on
> > > > > > MIPS,
> > > > > > but
> > > > > > you do that if (__WORDSIZE == 32). What "top 32 bits" are
> > > > > > we
> > > > > > talking
> > > > > > about there?
> > > > >
> > > > > It is a SLL(r0, r1, 0) that is supposed to sign extend the
> > > > > value. I
> > > > > do not
> > > > > have access to any mips release 6, so did not test the
> > > > > mips6_p()
> > > > > code
> > > > > variant.
> > > >
> > > > I tested MIPSr6 a few months ago and it didn't go very well,
> > > > some
> > > > instructions that Lightning emit did change (for instance, the
> > > > LO/HI
> > > > registers are gone, and all opcodes touching those changed).
>
> I see. I did only search for new documentation to find information
> about hardware clz, clo, etc.
>
> > > Did you test in real hardware or qemu?
> > >
> > > I might setup a qemu environment, but would be far better to
> > > test
> > > in real hardware. Qemu mips emulation last time I tested was way
> > > too slow...
> >
> > That was under Qemu; I don't have such hardware.
> >
> > As I'm using qemu-user it doesn't have to emulate the full system
> > and
> > the speed is quite OK.
>
> Ok. I will put it in my TODO to test and make it work with the
> changes
> for mips release 6.
>
> > > > > The documentation I did use (MD00087-2B-MIPS64BIS-AFP-
> > > > > 6.06.pdf)
> > > > > says:
> > > > >
> > > > > """
> > > > > Format: CLO rd, rs MIPS32
> > > > > Purpose: Count Leading Ones in Word
> > > > > To count the number of leading ones in a word.
> > > > > ...
> > > > > Restrictions:
> > > > > Pre-Release 6: To be compliant with the MIPS32 and MIPS64
> > > > > Architecture, software must place the same GPR num-
> > > > > ber in both the rt and rd fields of the instruction. The
> > > > > operation of
> > > > > the instruction is UNPREDICTABLE if the rt and
> > > > > rd fields of the instruction contain different values.
> > > > > Release
> > > > > 6’s
> > > > > new
> > > > > instruction encoding does not contain an rt field.
> > > > >
> > > > > If GPR rs does not contain a sign-extended 32-bit value (bits
> > > > > 63..31
> > > > > equal), then the results of the operation are
> > > > > UNPREDICTABLE.
> > > > > """
> > > >
> > > > Yes, but in the case where __WORDSIZE == 32, bits 63..32 do not
> > > > exist.
> > > > Therefore the sign-extension does nothing.
> > >
> > > The common case is a 32 bit OS in a 64 bit cpu. This is also
> > > how it
> > > was tested. If the condition of a "true" 32 bit cpu can be
> > > detected,
> > > then could add a jit_cpu_t flag to know about it, and omit the
> > > sign
> > > extension.
> >
> > It doesn't matter whether the CPU is 32 or 64 bits: if you are only
> > ever generating MIPS32 opcodes (aka. no DSLL etc., which is the
> > case
> > when __WORDSIZE == 32), then the upper 32 bits will always be sign-
> > extended.
>
> Not always, when creating constants top 32 bit might have wrong
> sign extension, due to:
>
> # if __WORDSIZE == 32
> # define can_sign_extend_int_p(im) 1
> # define can_zero_extend_int_p(im) 1
> # else
> ...
>
> It appears CLO, and CLZ are special cases about the top 32 bits.
No, please trust me on this. If you only ever emit MIPS32 opcodes
(which is the case when __WORDSIZE == 32), the upper 32 bits will
always correspond to the sign bit.
This is ensured by the MIPS64 spec, and would otherwise make MIPS32
backwards-compatibility impossible.
Your code path in _movi() that uses can_zero_extend_int_p (which is
never called on MIPS32 but let's say it was), would generate:
ORI(r0, _ZERO_REGNO, i0 >> 16);
SLL(r0, r0, 16);
On a MIPS64 processor running a 32-bit OS, if i0 == 0x8000.0000, you
would end up with r0 == 0xffff.ffff.8000.0000, because SLL is a MIPS32
instruction.
The CLO and CLZ instructions are no different; their source register
has to be sign-extended, but that is only something that you need to
care about when mixing these instructions with MIPS64 code.
Cheers,
-Paul
> > > > > I did Lightning 2.2.1 release to have public several bug
> > > > > fixes,
> > > > > but
> > > > > I hope to add extra bit manipulation instructions. At least:
> > > > >
> > > > > o bit invert
> > > > > o popcount
> > > > > o bit rotate
> > > > >
> > > > > But there are several other that are useful, like ways to
> > > > > create
> > > > > bit patterns for any kind of masks. These could at least be
> > > > > used
> > > > > internally to create constants with repeated patterns.
> > > > >
> > > > > If you have other suggestions for new instructions, please
> > > > > let
> > > > > me
> > > > > now :)
> > > >
> > > > Honestly, apart from the "CLS" mentioned before and maybe
> > > > popcount,
> > > > I
> > > > wouldn't have any use for these - in my particular usecase
> > > > anyway.
> > > >
> > > > I would maybe benefit from having "mask extract" and "mask
> > > > insert"
> > > > functions similar to EXT/INS on MIPS.
> > > >
> > > > But in general I like that Lightning is very RISC-like and I
> > > > would
> > > > avoid making it more complex adding instructions that would
> > > > almost
> > > > never be used.
> > > >
> > > > > One such instruction could be "multiply and add", available
> > > > > in
> > > > > several
> > > > > cpus.
> > > > >
> > > > > On the long term can add int128 and complex float/double. I
> > > > > would
> > > > > like to have it, but implementing in all ports is not
> > > > > trivial,
> > > > > and
> > > > > would
> > > > > require the concept of register pairs, currently only barely
> > > > > used
> > > > > for
> > > > > qdiv/qmul, and only to put the result pair, not as input.
> > > > >
> > > > > Maybe could add a way to inject machine code also, just
> > > > > memcpy
> > > > > a buffer. This could allow to make optimizations where
> > > > > lightning
> > > > > does
> > > > > not generate good code, just experiment it with an assembler,
> > > > > then,
> > > > > when happy with the code, inject it in the jit code.
> > > >
> > > > One thing somewhat related that would be very useful to me, is
> > > > patchable jumps after code generation.
> > > >
> > > > Basically, if you emit:
> > > >
> > > > lbl = jit_jmpi();
> > > > jit_patch_abs(lbl, my_fn);
> > > >
> > > > ...
> > > > jit_emit();
> > > > addr = jit_address(lbl);
> > > >
> > > > You would then be able to change the function called using
> > > > something
> > > > like:
> > > >
> > > > jit_patch_again(addr, my_other_fn);
> > >
> > > It would be required to unmap and remap the code buffer.
> > >
> > > Part of it is done in the example in check/protect.c. After
> > > that, currently would need to manually patch it, basically
> > > copying
> > > the _patch_at() specific to the architecture where it is
> > > implemented.
> > > If it is not really in some inner loop that needs to be as fast
> > > as
> > > possible, could load the pointer from a constant pool.
> >
> > Loading the jump target from a constant pool would work but it
> > kinds of
> > defeat the purpose - the goal is to make it easier for the CPU's
> > branch
> > target prediction.
>
> It would basically be self modifiying code. The full approach
> should
> be to generate the longest sequence, the most expensive is movi_p
> followed by a jmpr. Then, rewrite the jump; the worst case would be
> a very far jump, that would still need the movi_p/jmpr; and fill
> extra
> instructions with nops if can encode a short jump. Note that for mips
> would also need to not use the 'swap_ds' logic.
>
> > Cheers,
> > -Paul
> >
> > > > > > > o Correct several bugs with jit_arg_register_p and
> > > > > > > jit_putarg{r,i}{_f,_d}.
> > > > > > > These bugs were not noticed earlier due to an incorrect
> > > > > > > check
> > > > > > > for
> > > > > > > correctness in check/carg.c.
> > > > > > > o Add rip relative addressing support for x86_64 and
> > > > > > > shorter
> > > > > > > signed
> > > > > > > 64
> > > > > > > bit constant load if the constant fits in a signed 32
> > > > > > > bit
> > > > > > > integer.
> > > > > > > This significantly reduces code size generation.
> > > > > > > o Correct bugs in branch generation code for pppc and
> > > > > > > sparc.
> > > > > > > o Correct bug in signed 32 bit integer load in ppc 64
> > > > > > > bits.
> > > > > > > o Add short relative unconditional branches and calls to
> > > > > > > mips,
> > > > > > > reducing
> > > > > > > code size generation.
> > > > > > > o And several extra minor optimizations.
> > > > > > >
> > > > >
> > > > > Thanks,
> > > > > Paulo
> > > >
> >
- GNU Lightning 2.2.1 release, Paulo César Pereira de Andrade, 2023/02/17
- Re: GNU Lightning 2.2.1 release, Paul Cercueil, 2023/02/18
- Re: GNU Lightning 2.2.1 release, Paulo César Pereira de Andrade, 2023/02/18
- Re: GNU Lightning 2.2.1 release, Paul Cercueil, 2023/02/18
- Re: GNU Lightning 2.2.1 release, Paulo César Pereira de Andrade, 2023/02/18
- Re: GNU Lightning 2.2.1 release, Paul Cercueil, 2023/02/18
- Re: GNU Lightning 2.2.1 release, Paulo César Pereira de Andrade, 2023/02/18
- Re: GNU Lightning 2.2.1 release,
Paul Cercueil <=
- Re: GNU Lightning 2.2.1 release, Paulo César Pereira de Andrade, 2023/02/20
- Re: GNU Lightning 2.2.1 release, Paul Cercueil, 2023/02/24