Re: RFE: enable buffering on null-terminated data

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFE: enable buffering on null-terminated data

From:	Carl Edquist
Subject:	Re: RFE: enable buffering on null-terminated data
Date:	Mon, 11 Mar 2024 06:54:11 -0500 (CDT)

On Sun, 10 Mar 2024, Zachary Santer wrote:

On Sun, Mar 10, 2024 at 4:36 PM Carl Edquist <edquist@cs.wisc.edu> wrote:


Out of curiosity, do you have an example command line for your use case?


My use for 'stdbuf --output=L' is to be able to run a command within a
bash coprocess.


Oh, cool, now you're talking!  ;)

(Really, a background process communicating with the parent processthrough FIFOs, since Bash prints a warning message if you try to runmore than one coprocess at a time. Shouldn't make a difference here.)

(Kind of a side-note ... bash's limited coprocess handling was a longstanding annoyance for me in the past, to the point that I wrote a bashcoprocess management library to handle multiple active coprocess and giveconvenient methods for interaction. Perhaps the trickiest bit aboutmultiple coprocesses open at once (which I suspect is the reason supportwas never added to bash) is that you don't want the second and subsequentcoprocesses to inherit the pipe fds of prior open coprocesses. This canresult in deadlock if, for instance, you close your write end to coproc1,but coproc1 continues to wait for input because coproc2 also has a copy ofa write end of the pipe to coproc1's input. So you need to be smart aboutsubsequent coprocesses first closing all fds associated with othercoprocesses.

Word to the wise: you might encounter this issue (coproc2 prevents coproc1from seeing its end-of-input) even though you are rigging this up yourselfwith FIFOs rather than bash's coproc builtin.)

See coproc-buffering, attached.


Thanks!

Without making the command's output either line-buffered or unbuffered,what I'm doing there would deadlock. I feed one line in and then expectto be able to read a transformed line immediately. If that transformedline is stuck in a buffer that's still waiting to be filled, thennothing happens.
I swear doing this actually makes sense in my application.


Yeah makes sense!  I am familiar with the problem you're describing.

(In my coprocess management library, I effectively run every coproc with--output=L by default, by eval'ing the output of 'env -i stdbuf -oL env',because most of the time for a coprocess, that's whats wanted/necessary.)

... Although, for your example coprocess use, where the shell bothproduces the input for the coproc and consumes its output, you might beable to simplify things by making the producer and consumer separateprocesses. Then you could do a simpler 'producer | filter | consumer'without having to worry about buffering at all. But if the producer andconsumer need to be in the same process (eg they share state and arelogically interdependent), then yeah that's where you need a coprocess forthe filter.

... On the other hand, if the issue is that someone is producing one lineat a time _interactively_ (that is, inputting text or commands from aterminal), then you might argue that the performance hit for unbufferedoutput will be insignificant compared to time spent waiting for terminalinput.

$ ./coproc-buffering 100000
Line-buffered:
real    0m17.795s
user    0m6.234s
sys     0m11.469s
Unbuffered:
real    0m21.656s
user    0m6.609s
sys     0m14.906s


Yeah, this makes sense in your particular example.

It looks like expand(1) uses putchar(3), so in unbuffered mode thistranslates to one write(2) call for every byte. sed(1) is not quite asbad - in unbuffered it appears to output the line and the newlineterminator separately, so two write(2) calls for every line.

So in both cases (but especially for expand), line buffering reduces thenumber of write(2) calls.

(Although given your time output, you might say the performance hit forunbuffered is not that huge.)

When I initially implemented this thing, I felt lucky that the data Iwas passing in were lines ending in newlines, and not null-terminated,since my script gets to benefit from 'stdbuf --output=L'.


:thumbsup:

Truth be told, I don't currently have a need for --output=N.


Mmm-hmm  :)

Of course, sed and all sorts of other Linux command-line tools canproduce or handle null-terminated data.

Definitely. So in the general case, theoretically it seems as useful tobuffer output on nul bytes.

Note that for gnu sed in particular, there is a -u/--unbuffered option,which will effectively give you line buffered output, including bufferingon nul bytes with -z/--null-data .

... I'll be honest though, I am having trouble imagining a realisticpipeline that filters filenames with embedded newlines using expand(1);)


...

But, I want to be a good sport here and contrive an actual use case.

So for fun, say I want to use cut(1) (which performs poorly whenunbuffered) in a coprocess that takes null-terminated file paths on inputand outputs the first directory component (which possibly containsembedded newlines).


The basic command in the coprocess would be:

        cut -d/ -f1 -z

but with the default block buffering for pipe output, that will hang (theproblem you describe) if you expect to read a record back from it aftereach record sent.



The unbuffered approach works, but (as discussed) is pretty inefficient:

        stdbuf --output=0  cut -d/ -f1 -z

But, if we swap nul bytes and newlines before and after cut, then we canrun cut with regular newline line buffering, and get the desired effect:


        stdbuf --output=0 tr '\0\n' '\n\0' |
        stdbuf --output=L cut -d/ -f1      |
        stdbuf --output=0 tr '\0\n' '\n\0'

The embedded newlines in filenames will be passed by tr(1) to cut(1) asembedded nul bytes, cut will line-buffer its output, and the second trwill restore the original embedded newlines & null-terminated records.

Note that unbuffered tr(1) will still output its translated input inblocks (with fwrite(3)) rather than a byte at a time, so tr willeffectively give buffered output with the same size as the input records.

(That is, newline or null-terminated input records will effectivelyproduce newline or null-terminated output buffering, respectively.)

I'd venture to guess that most of the standard filters could be made topass along null-terminated records as line-buffered records the same way.Might even package it into a convenience function to set them up:



        swap_znl () { stdbuf -o0 tr '\0\n' '\n\0'; }

        nulterm_via_linebuf () { swap_znl | stdbuf -oL "$@" | swap_znl; }


Then, for example, stand it up with bash's coproc:

        $ coproc DC1 { nulterm_via_linebuf cut -d/ -f1; }

        $ printf 'a\nb/c\nd/efg\0' >&${DC1[1]}
        $ IFS='' read -rd '' -u ${DC1[0]} DIR
        $ echo "[$DIR]"
        [a
        b]

(or however else you manage your coprocesses.)

It's a workaround, and it keeps the kind of buffering you'd get with a'stdbuf --output=N', but to be fair the extra data shoveling is notexactly free.

...

So ... again in theory I also feel like a null-terminated buffering modefor stdbuf(1) (and setbuf(3)) is kind of a missing feature. It may justbe that nobody has actually had a real need for it. (Yet?)

I'm running bash in MSYS2 on a Windows machine, so hopefully thatdoesn't invalidate any assumptions.

Ooh. No idea. Your strace and sed might have different options thanmine. Also, I am not sure if there are different pipe and fd duplicationsemantics, compared to linux. But, based on the examples & output you'regiving, I think we're on the same page for the discussion.

Now setting up strace around the things within the coprocess, and onlypassing in one line, I now have coproc-buffering-strace, attached.Giving the argument 'L', both sed and expand call write() once. Givingthe argument 0, sed calls write() twice and expand calls it a bunch oftimes, seemingly once for each character it outputs. So I guess that'sit.


:thumbsup:  Yeah that matches what I was seeing also.


Thanks for humoring the peanut gallery here :D

Carl

[Prev in Thread]

Current Thread

[Next in Thread]

stdbuf feature request - line buffering but for null-terminated data, Zachary Santer, 2024/03/09
- Re: stdbuf feature request - line buffering but for null-terminated data, Pádraig Brady, 2024/03/10
  - RFE: enable buffering on null-terminated data, Zachary Santer, 2024/03/10
    - Re: RFE: enable buffering on null-terminated data, Carl Edquist, 2024/03/10
    - Re: RFE: enable buffering on null-terminated data, Zachary Santer, 2024/03/10
    - Re: RFE: enable buffering on null-terminated data, Carl Edquist <=
    - Re: RFE: enable buffering on null-terminated data, Zachary Santer, 2024/03/11
    - Re: RFE: enable buffering on null-terminated data, Carl Edquist, 2024/03/14
    - Re: RFE: enable buffering on null-terminated data, Zachary Santer, 2024/03/17
    - Re: RFE: enable buffering on null-terminated data, Kaz Kylheku, 2024/03/19
    - Re: RFE: enable buffering on null-terminated data, Zachary Santer, 2024/03/19
    - Re: RFE: enable buffering on null-terminated data, Carl Edquist, 2024/03/20
  - Re: stdbuf feature request - line buffering but for null-terminated data, Kaz Kylheku, 2024/03/12
    - Re: stdbuf feature request - line buffering but for null-terminated data, Zachary Santer, 2024/03/12
    - Re: stdbuf feature request - line buffering but for null-terminated data, Jeffrey Walton, 2024/03/12
- Re: stdbuf feature request - line buffering but for null-terminated data, Kaz Kylheku, 2024/03/12

Prev by Date: Re: RFE: enable buffering on null-terminated data
Next by Date: Fix race conditions in timeout(1)
Previous by thread: Re: RFE: enable buffering on null-terminated data
Next by thread: Re: RFE: enable buffering on null-terminated data
Index(es):
- Date
- Thread