bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Parallel Bug Reports Unexpected behavior when handling binary da


From: Tim Mattison
Subject: Re: GNU Parallel Bug Reports Unexpected behavior when handling binary data with --regexp and --recstart
Date: Thu, 18 Jun 2015 03:26:50 -0700 (PDT)

For dd I added "&> /dev/null" at the end to clean up the on-screen output not realizing that it would redirect everything.  I should've tested that but dd does output to stdout if you don't specify the "of" option.  This has the effect that I'm looking for:

dd if=/dev/random bs=1000 count=1 > test.raw 2> /dev/null

However, I did use ggrep in the example which is GNU's grep from brew so I could get support for the -P option.  Mac OS has some really weird versions of these tools.

When I pipe a large amount of data to this script I do get multiple chunks but seemingly randomly I get a chunk that contains two sections instead of one.  If I modify the test scripts so they use "-kN1" instead of "-k" they seem to do what I want.  I've attached a new test script below.

"-N1" forces it to do one job at a time, is that correct?  If so, why would that change this behavior.  If not, what does -N1 do that fixes this?

Thanks,
Tim

-- CUT HERE --
#!/usr/bin/env bash

# Remove any old test results
rm -f *.test-result

# Create the test file
echo -n "Creating test file... "
echo -ne '\x00\x00\x00\x01\x67\xaa' > test.raw
dd if=/dev/zero bs=1000 count=1 >> test.raw 2> /dev/null
echo -ne '\x00\x00\x00\x01\x67\xbb' >> test.raw
dd if=/dev/zero bs=1000 count=1 >> test.raw 2> /dev/null
echo -ne '\x00\x00\x00\x01\x67\xcc' >> test.raw
dd if=/dev/zero bs=1000 count=1 >> test.raw 2> /dev/null
echo "done."

# Validate that the test file is the correct size
expected_size=3018
actual_size=$(wc -c <"test.raw")
if [ $actual_size -ne $expected_size ]; then
    echo Size is incorrect.  Expected: $expected_size, actual: $actual_size
    exit 1
fi

# Count the number of times grep finds this pattern (using ggrep since we're on Mac OS)
echo -n "Instances of pattern found with grep: "
ggrep -obUaP '\x00\x00\x00\x01\x67' test.raw | wc -l

# Have GNU Parallel split up the file based on the given pattern as a regexp
cat test.raw | parallel -k --pipe --regexp --recstart '\x00\x00\x00\x01\x67' --recend '' cat\>{#}.test-result &> /dev/null

# Count the number of output files GNU Parallel created
echo -n "Output files from GNU Parallel with -k option: "
ls -la *.test-result | wc -l

# Remove the test results for the second run
rm *.test-result

# Have GNU Parallel split up the file based on the given pattern as a regexp
cat test.raw | parallel -kN1 --pipe --regexp --recstart '\x00\x00\x00\x01\x67' --recend '' cat\>{#}.test-result &> /dev/null

# Count the number of output files GNU Parallel created
echo -n "Output files from GNU Parallel with -k option: "
ls -la *.test-result | wc -l

# Remove the test results
rm *.test-result


Sent from Mailbox


On Thu, Jun 18, 2015 at 3:21 AM, Andreas Bernauer <address@hidden> wrote:

On 18/06/15 1:12, Tim Mattison wrote:
> I have some data that I want to process with GNU Parallel that is pure
> binary data. I saw an example somewhere that showed how I could use
> --regexp and --recstart to specify a binary pattern and it seemed like
> it worked at first but after running it for a while I noticed that it
> appeared to be missing the binary pattern sometimes. I wrote a script
> that reproduces this issue and wanted to see if someone could explain if
> this is expected or not.

While I can reproduce your 'bug', your script does not do what you think
it does. :-)

(dd does not output to stdout, MacOS' grep does not understand the -P
(--perl-regexp) option, and the script should be run with bash (for
echo's '-n').)

Anyhow, I attached an updated script test2.sh, which still shows the
'bug'. parallel seems to split only the last record.

I put 'bug' in quote, as it does not seem to have to do with binary
data. The 'bug' appears with regular (printable) data, too, see attached
test3.sh script. I suppose the record splitting feature is tested, so we
probably do not use it properly?

-Andreas
~~~~~~~~~~~~~~~~
$ ls
test.sh* test2.sh*
$ ./test2.sh
Creating test file... done.
Instances of pattern found with grep: 3
Output files from GNU Parallel: 2
test.raw:
0000000 01 67 aa 00 00 00 00 00 00 00 00 00 00 00 00 00
0000010 01 67 bb 00 00 00 00 00 00 00 00 00 00 00 00 00
0000020 01 67 cc 00 00 00 00 00 00 00 00 00 00 00 00 00
0000030
parallel's results:
1.test-result
0000000 01 67 aa 00 00 00 00 00 00 00 00 00 00 00 00 00
0000010 01 67 bb 00 00 00 00 00 00 00 00 00 00 00 00 00
0000020
2.test-result
0000000 01 67 cc 00 00 00 00 00 00 00 00 00 00 00 00 00
0000010
$ parallel --version
GNU parallel 20150522
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015 Ole Tange
and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --bibtex'.
~~~~~~~~~~~~~~~~

With ASCII data in test.raw:
~~~~~~~~~~~~~~~
$ ./test3.sh
Creating test file... done.
Instances of pattern found with grep: 3
Output files from GNU Parallel: 2
test.raw:
0000000 41 42 43 78 61 61 31 32 33 34 35 36 37 38 39 30
0000010 41 42 43 78 62 62 31 32 33 34 35 36 37 38 39 30
0000020 41 42 43 78 63 63 31 32 33 34 35 36 37 38 39 30
0000030
parallel's results:
1.test-result
0000000 41 42 43 78 61 61 31 32 33 34 35 36 37 38 39 30
0000010 41 42 43 78 62 62 31 32 33 34 35 36 37 38 39 30
0000020
2.test-result
0000000 41 42 43 78 63 63 31 32 33 34 35 36 37 38 39 30
0000010
$ cat test.raw
ABCxaa1234567890ABCxbb1234567890ABCxcc1234567890
~~~~~~~~~~~~~~~~


>
> The script creates a file that has the binary pattern 0000000167 in it
> three times. Each instance of the pattern is followed immediately by
> AA, BB, or CC, and the 1000 bytes of zeroes.
>
> GNU grep reports that it sees this binary pattern three times. GNU
> Parallel splits this up into two files though and the first file has the
> AA and BB instances of the pattern in it.
>
> Do I need to do something else to make sure this pattern is checked for
> in a different way?
>
> This script is written to run on Mac OS. If you are running on Linux
> you'll need to change "ggrep" to "grep".
>
> Thanks,
> Tim
>
> -- CUT HERE --
> # Remove any old test results
> rm -f *.test-result
>
> # Create the test file
> echo -n "Creating test file... "
> echo -ne '\x00\x00\x00\x01\x67\xaa' > test.raw
> dd if=/dev/zero bs=1000 count=1 >> test.raw &> /dev/null
> echo -ne '\x00\x00\x00\x01\x67\xbb' >> test.raw
> dd if=/dev/zero bs=1000 count=1 >> test.raw &> /dev/null
> echo -ne '\x00\x00\x00\x01\x67\xcc' >> test.raw
> dd if=/dev/zero bs=1000 count=1 >> test.raw &> /dev/null
> echo "done."
>
> # Count the number of times grep finds this pattern (using ggrep since
> we're on Mac OS)
> echo -n "Instances of pattern found with grep: "
> ggrep -obUaP '\x00\x00\x00\x01\x67' test.raw | wc -l
>
> # Have GNU Parallel split up the file based on the given pattern as a regexp
> cat test.raw | parallel -k --pipe --regexp --recstart
> '\x00\x00\x00\x01\x67' --recend '' cat\>{#}.test-result &> /dev/null
>
> # Count the number of output files GNU Parallel created
> echo -n "Output files from GNU Parallel: "
> ls -la *.test-result | wc -l
>
> # Remove the test results. Comment this out if you want to examine them
> after the fact.
> rm *.test-result
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
<test2.sh><test3.sh>

<test2.sh><test3.sh>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]