lilypond-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Blockers for Guile 2.2


From: Jean Abou Samra
Subject: Re: Blockers for Guile 2.2
Date: Sat, 26 Feb 2022 22:48:33 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0

[David]

I think where we ultimately want to end up is to have Guile use
optimisation for code loaded from .scm files (which should likely use
byte compilation) while not doing so for Guile code and definitions
invoked from .ly files with # and $ because those more likely than not
are not part of performance-relevant inner loops (and if they do, moving
them to .scm or otherwise specifically marking them might be a
reasonable requirement).

Note that "ultimately" does not mean that this may be the best strategy
right now.  But LilyPond documents contain a huge amount of ad-hoc
Scheme code introduced with # (often as simple as a numeric constant or
quoted code that may warrant special-casing, and that actually _is_
special-cased within #{ #} passages already to avoid having to form
closures for it).



For one thing I think we definitely do not want to implement
our own caching of byte-compiled code from .ly files. Because
Guile has no notion of dependency tracking, which means that
if a macro changes, it will not detect that it needs to recompile
its call sites as it will just have expanded the macro and
forgot about the origin of the expanded code. Actually I think
it depends on all bindings from the current module, so it would
be problematic even without macros, though I am not sure about that.
Nevertheless, it would be glaringly user-unfriendly to introduce
caching that does not update automatically for our own user base,
in my opinion.

With that in mind, whether to compile the code before executing
it (thus recompiling at every compilation of the .ly file) or to
simply primitive-eval it is a matter of seeing whether the
byte-compilation is fast enough and whether it actually
enhances usability. Speed should really not be an issue for
code from .ly files (whether those from LilyPond or the users').
On the other hand, byte-compiled code tends to give better
error messages when it fails. I have not experimented yet
(usually I try to do the experiments before I write to the
list, but I'm swamped right now), though I think it shouldn't
be too hard a decision.

The status quo for now is that LilyPond's .scm files are
run by the virtual machine whereas Scheme code in # and $
is run through the evaluator (i.e., primitive-eval). That
is in fact exactly what you describe as "where we ultimately
want to end up".



Of course your timings are quite encouraging here for seeing a path
forward but working out the details of what combinations make best sense
both in the short and the long run is likely going to need quite more
experimentation.


At least, I think the status quo is acceptable. Whether other
strategies can turn out better will be interesting to see,
but can be done in a later step.



[Jonas]
He, I always thought auto-compilation didn't optimize! 😕 now don't
tell me Guile also applies optimizations while just reading and
supposedly interpreting code...


I don't think it does. At least, you don't usually call eval or
primitive-eval on code to run it repeatedly, just for one-shot
code, so I don't see the sense it would make. Also, it seems
to me that the Guile developers are much more focused on compilation.



[ skipping over the part regarding Guile 3, since I don't think it's
relevant here ]


Perhaps I should have changed the title, but I do think it
is relevant -- it gives hope for a state where development
cycles will be easier. When we want Guile 3 is another question.
I'm in favor of not making the move to Guile 2 wait more
than it already has, but I wonder if at some point in this release
cycle (by which I mean before the next stable release) we will want
to switch to Guile 3. Of course, that will depend on how easy
it turns out to fix issues like

https://gitlab.com/lilypond/lilypond/-/merge_requests/1230#note_855980027


Yes, it looks like we should do this! On the patch, I think it would be
better to apply the strategy from module/scripts/compile.scm and just
get all available-optimizations from the concatenation of tree-il-
default-optimization-options and cps-default-optimization-options
instead of hard-coding the list.


Yes, that is what should be done if we decide to turn off
them all. I still have to do detailed benchmarking to see
which optimizations might save a few percents and which are
expensive, to decide on the final list. Again, that is just
waiting on me having more time (but of course, speaking to
everyone, feel free to beat me to it and experiment by
yourself).



[Luca]
Jean,
how many times did you run these tests?
Eyeballing your numbers it seems there's effectively no difference in execution time opt/no-opt and 2.2/3.0.
Is the 5% a stable figure, or is it just a one-sample thing?



It's a one-sample thing. I was just trying to get orders
of magnitude (for Guile 3 it's 1m30 vs. 4s, no need for
precise benchmarks to see which is faster :-).




Would it be a passable inference that the reason the optimizer has effectively no measurable impact in either runtime is that the scheme source code runs are comparatively short and contain large amounts of code the optimizer can't change? I'm imagining that if your source is largely an alternation of scheme language built-ins (or scheme library code) interspersed with fairly frequent calls into lilypond's brain, the optimizer can't do much about the latter. At the same time, you might be sitting in front of gains coming from making these API calls more efficient, which could be interesting (albeit largely orthogonal to the discussion at hand). I'm not sure how division of cost works out, if there is overhead in preparing for invoking the callbacks vs executing them, for example. I guess insight in that
could help focus effort, if any was warranted.



I am no performance expert, but LilyPond's performance-critical parts
are written in C++, so in general I would think the cost of Scheme code
is spread a bit all over the code base (unlike C++ where you can find
some rather critical parts taking lots of time, like page breaking,
or skyline algorithms). I am not sure what you mean by effort spent
in "preparing" callbacks, could you elaborate?

At any rate, yes, this is rather orthogonal to this discussion (right
now, the performance of Guile 2 executables with compile bytecode is
on par with Guile 1 ones, except for the oddity I mentioned on macOS,
for which I have still not had the time to try to build executables
with debugging symbols).



I thought the -O0 compilation time in Guile 3.0 was _really_ cool, I guess it indicates the front-end of the
3.x compiler is vastly more efficient?


See the article I linked. In short, they started to actually focus
on the compiler's performance, and were able to make a more minimal
version that does less optimization work.


Seems like it could be an interesting way forward for the dev group to run
3.x with -O0 for iteration cycles, and then do what David is saying to ship the scm file with optimizations on
and have the in-score Scheme be just built -O0.


Hm, note that -O0 is different from what is currently done, which
is not another -O level, but running code through the evaluator. There
are two parts in Guile >= 2, the evaluator, which is still there
for running code passed to eval and primitive-eval (think Python
exec()), and the byte-compiler + virtual machine. Optimization is
only done during bytecode generation. Currently, code embedded
in .ly files is routed through the evaluator. So whether we
want to byte-compile the code from .ly files and what its optimization
level should be if we do want to byte-compile it are different
questions (but probably the desirable level would be -O0).



Reading briefly the message you posted, it seems -O1 might be a better way to go still, which might regain a teeny tiny bit of speed at a potentially very modest cost (say even if your 3.5s become 4 or 5, you still come out
with a net win compared to 2.2's 20s).


That is possible. In fact, the blog post mentions -O1 being
_faster_ than -O0. We need to experiment.


I agree these results are a very cool finding.


Yeah, I think we're on the right track.

Best,
Jean




reply via email to

[Prev in Thread] Current Thread [Next in Thread]