coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

line buffering in pipes (was: RFC: Safely using xargs -P$NUM children's


From: Assaf Gordon
Subject: line buffering in pipes (was: RFC: Safely using xargs -P$NUM children's output? Need a new tool?)
Date: Thu, 2 May 2019 13:14:13 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

Hello,

On 2019-05-02 7:57 a.m., Denys Vlasenko wrote:
I'm working on improving a script in rpmbuild:

# Strip static libraries.
for f in `find "$RPM_BUILD_ROOT" -type f -a -exec file {} \; | \
         grep -v "^${RPM_BUILD_ROOT}/\?usr/lib/debug"  | \
         grep 'current ar archive' | \
         sed -n -e 's/^\(.*\):[  ]*current ar archive/\1/p'`; do
         $STRIP -g "$f"
done

[...]

Stress-testing, however, of the > xargs -r -P$NPROC -n16 file | sed 's/: */: /'> construct
revealed that with sufficiently large machines, pipe
buffer gets filled and "file" processes experience partial writes, garbling the output. specifically, this:
find /usr -print0 | xargs -0r -P199 -n16 file | sed 's/:  */: /' |
sort >
does not produce the same output every time, and diff-ing if clearly
shows partial writes creating overlapping output.
This is explicitly mentioned in the xargs(1) man-page:

    Please note that it is up to the called processes to properly manage
    parallel access to shared resources.  For example, if more than one
    of them tries to  print to stdout, the output will be produced in an
    indeterminate order (and very likely mixed up) unless the processes
    collaborate in some way to prevent this.  Using
    some kind of locking scheme is one way to prevent such problems.

The easiest way to avoid that is to use "stdbuf" (from coreutils),
forcing a flush after each line. Assuming the lines are short enough
(and file's output should be short enough), it should work:

   find /usr -print0 | xargs -0r -P199 -n16 stdbuf -oL file | ...

But in addition, 'file' itself has an option to disable buffering
when piping output (--no-buffer).

Additionally, you can avoid the need of the "sed /:  */: /"
if you use file's "--no-pad" option.

Also (IMO) if you output the mime type of the file instead of the textual description it makes the script clearer (using "file --mime-type").

Also, the "for i in `find...`" shell construct is wasteful.
You can use "xargs" on the output to run "strip -g".

The result could be something like:

  find [DIRECTORY] -type f \
    | xargs -r -P$NPROC \
             stdbuf -oL \
                file --no-buffer --no-pad --mime-type \
    | grep ": application/x-archive" \
    | cut -f1 -d: \
    | xargs strip -g

(Note the above does not deal with special characters in filenames,
or file names with ":".)

Lastly,
I assume you want to use "xargs -P" for increased performance.
However, I suspect "file" is not cpu-bound at all - and this entire
operation is I/O-bound - so running it in parallel does not make things
alot faster (just a hunch, I didn't actually measure).

If that's the case, and you do away with "xargs -P", you can use file's "--files-from" option to read the file list directly from "find":

    find [DIRECTORY] -type f \
       | file --no-buffer --no-pad --mime-type --files-from - \
       |  grep ": application/x-archive" \
       | cut -f1 -d: \
       | xargs strip -g


As such, I don't think this use case justifies a new program in coreutils.


Hope this helps.
regards,
  - assaf

P.S.
If you do need to worry about special characters in filenames (or file names with ':'), see file's "-print0" option in addition to "find -print0 | xargs -0".





reply via email to

[Prev in Thread] Current Thread [Next in Thread]