`process-send-*` performance seems ... bad?

G'day. I've been trying to work out of the fzf "fuzzy finder" system could be used effectively within Emacs; it has the closest to "right" algorithm.

Unfortunately it is written in Go, and the upstream are not interested in making it usable as either a pipe-process ala ispell pipe mode, or as a library so that such a front-end could be developed.

It does support batch matching, though, by invoking the command with the filter string, and sending the possible completions through it. Given that, I figured I'd test it out and see if the cost of running that external process was too high.

The problem seems to be that Emacs writes (or reads) from the piped process unreasonably slowly. Output to a buffer, or a process filter (#'ignore) seems to make no difference, so I'm assuming the write performance rather than read-and-store performance is the issue.

My question here is: is there something I'm doing wrong, or is this the very best I could expect from Emacs with an external process?

Emacs itself is built from git at commit 44a086e, and doesn't have debugging stuff enabled or whatever. macOS x86-64 system. Have not tested on GNU/Linux or anything at this time.

It doesn't look to me like pipes are inherently the problem on the platform, given this testing shows that the same test command in the same situation (pipe input and output) performs significantly faster that Emacs will in the tests following:

] mkfifo in out

] cat obarray.data > in & wc < out & time fzf -f "hook mode" < in > out ; wait
[1] 45534
[2] 45535
[1] - 45534 done gunzip -c obarray.data.gz > in
604 604 15841
fzf -f "hook mode" < in > out 0.05s user 0.01s system 184% cpu 0.033 total

[2] + 45535 done wc < out

] wc obarray.data

59176 59708 955548 obarray.data

Use of named pipes and cat to ensure with certainty we had the same basic throughput model without, eg, the shell doing something too smart or whatever.

That obarray.data file contains the result of inserting all symbols in the default obarray in my Emacs into a file, newline separated, and then saving it. This is a hard, but not impossible, test of the filtering system I think, intended to show bottlenecks. Smaller data sets were second in line to test, but...

Unfortunately Emacs performance is shockingly slow; all code is byte-compiled and warmed up before the benchmark tests are run. This is the result of 5 warmup rounds, followed by 10 measured rounds, through the external process. GC effectively disabled, and forced before each benchmark pass, to avoid that shaking things up.

Note that the cost of fzf is ~ 0.05s, so we attribute best case 1.747 seconds to Emacs.

========== pass 1 ==========
write-with-temp-buffer: 5.1272s with 0 GCs taking 0.0000s: (5.127218 0 0.0)
write-with-two-calls: 1.9974s with 0 GCs taking 0.0000s: (1.99745 0 0.0)
write-with-one-call: 1.6549s with 0 GCs taking 0.0000s: (1.6549049999999998 0 0.0)
========== pass 2 ==========
write-with-temp-buffer: 5.0828s with 0 GCs taking 0.0000s: (5.082764 0 0.0)
write-with-two-calls: 2.0536s with 0 GCs taking 0.0000s: (2.053559 0 0.0)
write-with-one-call: 1.7901s with 0 GCs taking 0.0000s: (1.790106 0 0.0)
========== pass 3 ==========
write-with-temp-buffer: 5.1175s with 0 GCs taking 0.0000s: (5.117541 0 0.0)
write-with-two-calls: 2.1159s with 0 GCs taking 0.0000s: (2.115893 0 0.0)
write-with-one-call: 1.7970s with 0 GCs taking 0.0000s: (1.797042 0 0.0)

The code behind those results is attached, and the obarray content is available on request – it is still 375k gzipped, and isn't particularly interesting, as most any "production" Emacs instance will have a large enough obarray to be interesting.

I also pre-calculate the set of strings, to avoid that being part of the benchmark; handling the output is also skipped for the same reason.

Using `tail -n 604` to get roughly the same amount of output, and to force the command to run identically (ie: consume all input before output) makes no difference, as expected given the observed performance of fzf above. (middle run, others identical and omitted for brevity.)

write-with-temp-buffer: 5.4254s with 0 GCs taking 0.0000s: (5.425443 0 0.0)
write-with-two-calls: 2.3092s with 0 GCs taking 0.0000s: (2.309152 0 0.0)
write-with-one-call: 2.0219s with 0 GCs taking 0.0000s: (2.021912 0 0.0)

Using a newline rather than null byte delimiter makes no difference, as you would probably expect. Using a list rather than a vector for `data`, the input, makes the temp buffer case ~ 1.3 seconds faster consistently, but the other two cases ... roughly equally, maybe a fraction slower, but this is adhoc enough that I think it is not statistically significant.

Thanks.

From:	Daniel Pittman
Subject:	`process-send-*` performance seems ... bad?
Date:	Tue, 18 Jun 2019 17:41:22 -0400