emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Filtering out process filters


From: Eli Zaretskii
Subject: Re: Filtering out process filters
Date: Thu, 05 Jun 2025 12:02:39 +0300

> From: Daniel Colascione <dancol@dancol.org>
> Cc: Dmitry Gutov <dmitry@gutov.dev>,  arstoffel@gmail.com,  
> emacs-devel@gnu.org
> Date: Wed, 04 Jun 2025 17:01:21 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > If we want to look into the reasons seriously, these 3 cases should be
> > profiled, preferably with 'perf', and we should examine the profiles.
> 
> Come to think of it, we really should have LTTng/

I think perf does the job nicely and produces useful information
(assuming you didn't strip the Emacs binary).

> > My conclusions from the numbers are that the significant differences
> > between GUI and TTY sessions happen only on NS, which is a known
> > outlier wrt both ns_select and redisplay.  On GNU/Linux, according to
> > Dmitry's numbers, the only significant difference I see is for 'term'
> > (which perhaps shouldn't surprise, since 'term' is all about display,
> > which is more expensive in GUI mode due top fonts and stuff; but
> > that's a speculation ATM).
> 
> GNU/Linux isn't one thing though. I'd love to see numbers of pgtk
> vs. GTK vs. the legacy toolkits.

Sure.

> (Do we have any actual users left of the non-GTK X11 toolkits, BTW? It'd
> be nice to delete them.)

Yes, there are still people using non-GTK toolkits.  GTK is
problematic in more than one way; several known bugs can be worked
around only by not using it.

> I'm also curious how the MS-Windows build does here. It's NS-like in
> that it has a pretty hairy event loop to accommodate both the win32
> message pump and the FD-centric IO stuff.

If you or someone else could post a recipe that I can try on Windows
without too much tinkering (like without using /dev/zero), for which I
don't have time, I will run it and report the results.

(We should really have a single portable benchmark for this, and we
should also agree on the Emacs version and perhaps the optimization
options we use for running the benchmarks.  Otherwise we are confusing
ourselves by potentially comparing apples to oranges.)

> > In batch mode, quite a lot of display stuff is bypassed entirely, and
> > the only frame we have there is 10x10, much smaller than anything in
> > interactive sessions.  That should account for some differences
> > between batch and TTY cases.
> 
> Depends on where the profile lands.

Sure, that's why I suggested 'perf'.  profiler.el is not accurate
enough, and cannot look inside C code.

> That said, regardless of what we find, we should consider not doing
> modeline updates when a buffer 1) has a no-background-ui-update
> buffer-local is set, and 2) isn't actually displayed.  We'd say
> no-background-ui-update by default for buffers starting with space just
> like we set buffer-undo-list to t.

When redisplay is called, it isn't called per-buffer, and it isn't
told which buffers changed and which windows are related to each
buffer.  It needs to figure that out all by itself, and as result to
decide which window, if any, needs to be updated, and how to update
each one of them in the cheapest possible manner.

The "update modelines" flag (and other similar ones we maintain) are
set by certain primitives, and tell redisplay several important things,
like whether only the selected window should be updated, whether each
frame and window must be considered for update, etc.  The result
should be that if no displayed buffer has changed in any way,
redisplay does almost nothing, once it considers all those flags and
finds none of them set.  That includes not updating the mode lines.

I see no sign in the posted profiles that any mode line is being
updated each redisplay, or even enough of them to be of a concern.
(In fact, I'm not yet sure the processing by redisplay is at all a
significant factor in the scenarios we are discussing and analyzing.)
So I'm not sure we have a problem here.  But if you see evidence to
the contrary, we could look closer.

> Come to think of it, though, we can make it go even faster:
> 
> 1. Get rid of the SAFE_ALLOCA bounce buffer (which in practice mallocs
> for large receive sizes) in favor of a persistent thread-local bounce
> buffer owned by process.c that we keep allocated. Saves a heap lock
> (although most good heaps have good thread-local caches nowadays).

Let's first make sure these allocations are high on the execution
profile.  Otherwise we will increase complexity for no good reasons.

> 2. Get rid of the bounce buffer by read(2)-ing read directly into the
> buffer. Saves a memcpy.

Again, does memcpy take any significant percentage of CPU cycles?

Also, reading into the buffer will be less memory-efficient because
you don't know in advance how many bytes will be read, so we will
need to enlarge the gap by the value of read-process-output-max, which
could be large.

I also have a vague recollection that we already tried these
techniques, and decided against them.  Perhaps Dmitry remembers the
details.

> 3. io_uring IO directly into a per-channel bounce buffer (not the buffer
> content, because concurrency). Doing so would at least ~halve the number
> of system calls we need to do IO. (You'd use the same thing with
> MS-Windows IO completion ports.)

I don't think I understand this idea.

> > The only two
> > significant factors I'm aware of are consing of strings and decoding.
> > The latter we could perhaps disable in the other two methods, so we'd
> > see the effect of consing strings alone.  Again, an accurate profile
> > should probably answer these questions better and more clearly.
> 
> My benchmark is all-unibyte, FWIW.

So the only slowdown factor is consing of strings?  I'd love to see
that in the profile, because it sounds surprising that it could
explain such a significant slowdown.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]