emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Filtering out process filters


From: Daniel Colascione
Subject: Re: Filtering out process filters
Date: Wed, 04 Jun 2025 17:01:21 -0700
User-agent: mu4e 1.12.10; emacs 31.0.50

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Wed, 4 Jun 2025 05:41:17 +0300
>> Cc: arstoffel@gmail.com, emacs-devel@gnu.org
>> From: Dmitry Gutov <dmitry@gutov.dev>
>> 
>> On 03/06/2025 23:58, Daniel Colascione wrote:
>> > Default gc size for all of these.
>> > 
>> > Baseline (batch):
>> > 
>> > buffer+acf: 5037 MB/s
>> > term: 314 MB/s
>> > jsonrpc: 577 MB/s
>> > 
>> > NS -nw:
>> > 
>> > buffer+acf: 5046 MB/s
>> > term: 108 MB/s
>> > jsonrpc: 208 MB/s
>> > 
>> > NS:
>> > 
>> > buffer+acf: 1353 MB/s
>> > term: 93 MB/s
>> > jsonrpc: 149 MB/s
>> > 
>> > The NS slowness persists if I set mode-line-format (both locally and
>> > globally) to "".
>> > 
>> > At least for me, the rank order of buffer+acf
>> > vs. filters persists, but GUI Emacs is just much slower for some reason.
>> > ns_select does a bunch of random things I haven't totally grokked, but
>> > it shouldn't be_that_ bad.
>> 
>> I'm seeing similar with the GNU/Linux build, so it's not just NS:
>> 
>> batch:
>> 
>> buffer+acf: 2741 MB/s
>> term: 87 MB/s
>> jsonrpc: 139 MB/s
>> 
>> nw:
>> 
>> buffer+acf: 2556 MB/s
>> term: 57 MB/s
>> jsonrpc: 94 MB/s
>> 
>> gui:
>> 
>> buffer+acf: 2036 MB/s
>> term: 55 MB/s
>> jsonrpc: 97 MB/s
>> 
>> That's with the default gc-cons-threshold.
>> 
>> The specific numbers seem to fluctuate. The buffer/filter ratio seems 
>> sharper than in your results, though.
>> 
>> With 10GB, it's like this:
>> 
>> batch:
>> 
>> buffer+acf: 2689 MB/s
>> term: 172 MB/s
>> jsonrpc: 921 MB/s
>> 
>> nw:
>> 
>> buffer+acf: 2615 MB/s
>> term: 232 MB/s
>> jsonrpc: 885 MB/s
>> 
>> gui:
>> 
>> buffer+acf: 2132 MB/s
>> term: 140 MB/s
>> jsonrpc: 715 MB/s
>
> If we want to look into the reasons seriously, these 3 cases should be
> profiled, preferably with 'perf', and we should examine the profiles.

Come to think of it, we really should have LTTng/

> My conclusions from the numbers are that the significant differences
> between GUI and TTY sessions happen only on NS, which is a known
> outlier wrt both ns_select and redisplay.  On GNU/Linux, according to
> Dmitry's numbers, the only significant difference I see is for 'term'
> (which perhaps shouldn't surprise, since 'term' is all about display,
> which is more expensive in GUI mode due top fonts and stuff; but
> that's a speculation ATM).

GNU/Linux isn't one thing though. I'd love to see numbers of pgtk
vs. GTK vs. the legacy toolkits.

(Do we have any actual users left of the non-GTK X11 toolkits, BTW? It'd
be nice to delete them.)

I'm also curious how the MS-Windows build does here. It's NS-like in
that it has a pretty hairy event loop to accommodate both the win32
message pump and the FD-centric IO stuff.

> In batch mode, quite a lot of display stuff is bypassed entirely, and
> the only frame we have there is 10x10, much smaller than anything in
> interactive sessions.  That should account for some differences
> between batch and TTY cases.

Depends on where the profile lands.

That said, regardless of what we find, we should consider not doing
modeline updates when a buffer 1) has a no-background-ui-update
buffer-local is set, and 2) isn't actually displayed.  We'd say
no-background-ui-update by default for buffers starting with space just
like we set buffer-undo-list to t.

> Btw, I still don't have a clear idea why the ACF method is so much
> faster, since it basically does the same stuff.

The ACF method, in the unibyte case (which is what you want for JSON and
terminal alike) boils down to memcpy() in process.c from the `chars`
buffer to the literal buffer contents --- followed by a cheap Lisp call
touching cache-hot conses (the ACF list).

Come to think of it, though, we can make it go even faster:

1. Get rid of the SAFE_ALLOCA bounce buffer (which in practice mallocs
for large receive sizes) in favor of a persistent thread-local bounce
buffer owned by process.c that we keep allocated. Saves a heap lock
(although most good heaps have good thread-local caches nowadays).

2. Get rid of the bounce buffer by read(2)-ing read directly into the
buffer. Saves a memcpy.

3. io_uring IO directly into a per-channel bounce buffer (not the buffer
content, because concurrency). Doing so would at least ~halve the number
of system calls we need to do IO. (You'd use the same thing with
MS-Windows IO completion ports.)

Granted, ACF is already pretty fast, so I don't think these
optimizations are really worth it, but they're there if we want.

> The only two
> significant factors I'm aware of are consing of strings and decoding.
> The latter we could perhaps disable in the other two methods, so we'd
> see the effect of consing strings alone.  Again, an accurate profile
> should probably answer these questions better and more clearly.

My benchmark is all-unibyte, FWIW.  Encoding is another story. I agree
that it shouldn't be _that_ slow, but as you can see from the no-GC
numbers above, most of the penalty with the default GC setup is just
repeated non-generational collection.

(I started looking at this because term.el was _so_ slow that it was
causing ssh connection timeouts tailing log files because rendering
couldn't keep up with the flow of log lines from the remote host.)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]