line buffering in pipes (was: RFC: Safely using xargs -P$NUM children's

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

line buffering in pipes (was: RFC: Safely using xargs -P$NUM children's

From:	Assaf Gordon
Subject:	line buffering in pipes (was: RFC: Safely using xargs -P$NUM children's output? Need a new tool?)
Date:	Thu, 2 May 2019 13:14:13 -0600
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

Hello,

On 2019-05-02 7:57 a.m., Denys Vlasenko wrote:

I'm working on improving a script in rpmbuild:

# Strip static libraries.
for f in `find "$RPM_BUILD_ROOT" -type f -a -exec file {} \; | \
         grep -v "^${RPM_BUILD_ROOT}/\?usr/lib/debug"  | \
         grep 'current ar archive' | \
         sed -n -e 's/^\(.*\):[  ]*current ar archive/\1/p'`; do
         $STRIP -g "$f"
done


[...]

Stress-testing, however, of the > xargs -r -P$NPROC -n16 file | sed 's/: */: /'> construct

revealed that with sufficiently large machines, pipe

buffer gets filled and "file" processes experience partial writes,garbling the output. specifically, this:
find /usr -print0 | xargs -0r -P199 -n16 file | sed 's/:  */: /' |
sort >
does not produce the same output every time, and diff-ing if clearly
shows partial writes creating overlapping output.

This is explicitly mentioned in the xargs(1) man-page:

    Please note that it is up to the called processes to properly manage
    parallel access to shared resources.  For example, if more than one
    of them tries to  print to stdout, the output will be produced in an
    indeterminate order (and very likely mixed up) unless the processes
    collaborate in some way to prevent this.  Using
    some kind of locking scheme is one way to prevent such problems.

The easiest way to avoid that is to use "stdbuf" (from coreutils),
forcing a flush after each line. Assuming the lines are short enough
(and file's output should be short enough), it should work:

   find /usr -print0 | xargs -0r -P199 -n16 stdbuf -oL file | ...

But in addition, 'file' itself has an option to disable buffering
when piping output (--no-buffer).

Additionally, you can avoid the need of the "sed /:  */: /"
if you use file's "--no-pad" option.

Also (IMO) if you output the mime type of the file instead of thetextual description it makes the script clearer (using "file --mime-type").


Also, the "for i in `find...`" shell construct is wasteful.
You can use "xargs" on the output to run "strip -g".

The result could be something like:

  find [DIRECTORY] -type f \
    | xargs -r -P$NPROC \
             stdbuf -oL \
                file --no-buffer --no-pad --mime-type \
    | grep ": application/x-archive" \
    | cut -f1 -d: \
    | xargs strip -g

(Note the above does not deal with special characters in filenames,
or file names with ":".)

Lastly,
I assume you want to use "xargs -P" for increased performance.
However, I suspect "file" is not cpu-bound at all - and this entire
operation is I/O-bound - so running it in parallel does not make things
alot faster (just a hunch, I didn't actually measure).

If that's the case, and you do away with "xargs -P", you can use file's"--files-from" option to read the file list directly from "find":


    find [DIRECTORY] -type f \
       | file --no-buffer --no-pad --mime-type --files-from - \
       |  grep ": application/x-archive" \
       | cut -f1 -d: \
       | xargs strip -g


As such, I don't think this use case justifies a new program in coreutils.


Hope this helps.
regards,
  - assaf

P.S.

If you do need to worry about special characters in filenames (or filenames with ':'), see file's "-print0" option in addition to "find-print0 | xargs -0".

[Prev in Thread]

Current Thread

[Next in Thread]

RFC: Safely using xargs -P$NUM children's output? Need a new tool?, Denys Vlasenko, 2019/05/02
- Re: RFC: Safely using xargs -P$NUM children's output? Need a new tool?, Egmont Koblinger, 2019/05/02
  - Re: RFC: Safely using xargs -P$NUM children's output? Need a new tool?, Denys Vlasenko, 2019/05/03
    - Re: RFC: Safely using xargs -P$NUM children's output? Need a new tool?, L A Walsh, 2019/05/04
- line buffering in pipes (was: RFC: Safely using xargs -P$NUM children's output? Need a new tool?), Assaf Gordon <=
  - Re: line buffering in pipes (was: RFC: Safely using xargs -P$NUM children's output? Need a new tool?), Egmont Koblinger, 2019/05/02
    - Re: line buffering in pipes, Assaf Gordon, 2019/05/02
    - Re: line buffering in pipes, Assaf Gordon, 2019/05/02
    - Re: line buffering in pipes, Egmont Koblinger, 2019/05/02
    - Re: line buffering in pipes, William Bader, 2019/05/03
  - Re: line buffering in pipes, Assaf Gordon, 2019/05/02

Prev by Date: Re: RFC: Safely using xargs -P$NUM children's output? Need a new tool?
Next by Date: Re: line buffering in pipes (was: RFC: Safely using xargs -P$NUM children's output? Need a new tool?)
Previous by thread: Re: RFC: Safely using xargs -P$NUM children's output? Need a new tool?
Next by thread: Re: line buffering in pipes (was: RFC: Safely using xargs -P$NUM children's output? Need a new tool?)
Index(es):
- Date
- Thread