coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFE: enable buffering on null-terminated data


From: Carl Edquist
Subject: Re: RFE: enable buffering on null-terminated data
Date: Thu, 14 Mar 2024 09:15:58 -0500 (CDT)


On Mon, 11 Mar 2024, Zachary Santer wrote:

On Mon, Mar 11, 2024 at 7:54 AM Carl Edquist <edquist@cs.wisc.edu> wrote:

(In my coprocess management library, I effectively run every coproc with --output=L by default, by eval'ing the output of 'env -i stdbuf -oL env', because most of the time for a coprocess, that's whats wanted/necessary.)

Surrounded by 'set -a' and 'set +a', I guess? Now that's interesting.

Ah, no - I use the 'VAR=VAL command line' syntax so that it's specific to the command (it's not left exported to the shell).

Effectively the coprocess commands are run with

        LD_PRELOAD=... _STDBUF_O=L command line

This allow running shell functions for the command line, which will all get the desired stdbuf behavior. Because you can't pass a shell function (within the context of the current shell) as the command to stdbuf.

As far as I can tell, the stdbuf tool sets LD_PRELOAD (to point to libstdbuf.so) and your custom buffering options in _STDBUF_{I,O,E}, in the environment for the program it runs. The double-env thing there is just a way to cleanly get exactly the env vars that stdbuf sets. The values don't change, but since they are an implementation detail of stdbuf, it's a bit more portable to grab the values this way rather than hard code them. This is done only once per shell session to extract the values, and save them to a private variable, and then they are used for the command line as show above.

Of course, if "command line" starts with "stdbuf --output=0" or whatever, that will override the new line-buffered default.


You can definitely export it to your shell though, either with 'set -a' like you said, or with the export command. After that everything you run should get line-buffered stdio by default.


I just added that to a script I have that prints lines output by another command that it runs, generally a build script, to the command line, but updating the same line over and over again. I want to see if it updates more continuously like that.

So, a lot of times build scripts run a bunch of individual commands. Each of those commands has an implied flush when it terminates, so you will get the output from each of them promptly (as each command completes), even without using stdbuf.

Where things get sloppy is if you add some stuff in a pipeline after your build script, which results in things getting block-buffered along the way:

        $ ./build.sh | sed s/what/ever/ | tee build.log

And there you will definitely see a difference.


        sloppy () {
                for x in {1..10}; do sleep .2; echo $x; done |
                sed s/^/:::/ | cat
        }

        {
                echo before:
                sloppy
                echo

                export $(env -i stdbuf -oL env)

                echo after:
                sloppy
        }

Yeah, there's really no way to break what I'm doing into a standard pipeline.

I admit I'm curious what you're up to  :)


Of course, using line-buffered or unbuffered output in this situation makes no sense. Where it might be useful in a pipeline is when an earlier command in a pipeline might only print things occasionally, and you want those things transformed and printed to the command line immediately.

Right ... And in that case, losing the performance benefit of a larger block buffer is a smaller price to pay.

My assumption is that line-buffering through setbuf(3) was implemented for printing to the command line, so its availability to stdbuf(1) is just a useful side effect.

Right, stdbuf(1) leverages setbuf(3).

setbuf(3) tweaks the buffering behavior of stdio streams (stdin, stdout, stderr, and anything else you open with, eg, fopen(3)). It's not really limited to terminal applications, but yeah it makes it easier to ensure that your calls to printf(3) actually get output after each line (whether that's to a file or a pipe or a tty), without having to call an explicit fflush(3) of stdout every time.

stdbuf(1) sets LD_PRELOAD to libstdbuf.so for your program, causing it to call setbuf(3) at program startup based on the values of _STDBUF_* in the environment (which stdbuf(1) also sets).

(That's my read of it anyway.)

In the BUGS section in the man page for stdbuf(1), we see: On GLIBC platforms, specifying a buffer size, i.e., using fully buffered mode will result in undefined operation.

Eheh xD

Oh, I imagine "undefined operation" means something more like "unspecified" here. stdbuf(1) uses setbuf(3), so the behavior you'll get should be whatever the setbuf(3) from the libc on your system does.

I think all this means is that the C/POSIX standards are a bit loose about what is required of setbuf(3) when a buffer size is specified, and there is room in the standard for it to be interpreted as only a hint.

If I'm not mistaken, then buffer modes other than 0 and L don't actually work. Maybe I should count my blessings here. I don't know what's going on in the background that would explain glibc not supporting any of that, or stdbuf(1) implementing features that aren't supported on the vast majority of systems where it will be installed.

Hey try it right?

Works for me (on glibc-2.23)

        $ for s in 8k 16k 32k 1M; do
            echo ::: $s :::
            { stdbuf -o$s strace -ewrite tr 1 2
            } < /dev/zero 2>&1 > /dev/null | head -3
            echo
          done

        ::: 8k :::
        write(1, "\0\0\0\0\0\0\0\0"..., 8192) = 8192
        write(1, "\0\0\0\0\0\0\0\0"..., 8192) = 8192
        write(1, "\0\0\0\0\0\0\0\0"..., 8192) = 8192

        ::: 16k :::
        write(1, "\0\0\0\0\0\0\0\0"..., 16384) = 16384
        write(1, "\0\0\0\0\0\0\0\0"..., 16384) = 16384
        write(1, "\0\0\0\0\0\0\0\0"..., 16384) = 16384

        ::: 32k :::
        write(1, "\0\0\0\0\0\0\0\0"..., 32768) = 32768
        write(1, "\0\0\0\0\0\0\0\0"..., 32768) = 32768
        write(1, "\0\0\0\0\0\0\0\0"..., 32768) = 32768

        ::: 1M :::
        write(1, "\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
        write(1, "\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
        write(1, "\0\0\0\0\0\0\0\0"..., 1048576) = 1048576



It may just be that nobody has actually had a real need for it. (Yet?)

I imagine if anybody has, they just set --output=0 and moved on. Bash scripts aren't the fastest thing in the world, anyway.

Ouch.  Ouch.  Ouuuuch.  :)

While that's true if you're talking about bash itself doing the actual computation and data processing, the main work of the shell is making it easy to set up pipelines for other (very fast) programs to pass their data around.

The stdbuf tool is not meant for the shell! It's meant for those very fast programs that the shell stands up.

Using stdbuf to tweak a very fast program, causing it to output more often at newlines over pipes rather than at block boundaries, does slow down those programs somewhat. But as we've discussed, this is necessary for certain pipelines that have two-way communication (including coprocesses), or in general any time you want the output immediately.

What may not be obvious is that the shell does not need to get involved with writing input for a coprocess or reading its output - the shell can start other (very fast) programs with input/output redirected to/from the coprocess pipes to do that processing.

My point though earlier was that a null-terminated record buffering mode, as useful as it sounds on the surface (for null-terminated paths), may actually be something _nobody_ has ever actually needed for an actual (not contrived) workflow.

But then again I say "Yet?" - because, never say never.


Happy line-buffering  :)

Carl


reply via email to

[Prev in Thread] Current Thread [Next in Thread]