bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU Parallel Bug Reports Unexpected behavior when handling binary data w


From: Tim Mattison
Subject: GNU Parallel Bug Reports Unexpected behavior when handling binary data with --regexp and --recstart
Date: Wed, 17 Jun 2015 16:12:34 -0700 (PDT)

I have some data that I want to process with GNU Parallel that is pure binary data.  I saw an example somewhere that showed how I could use --regexp and --recstart to specify a binary pattern and it seemed like it worked at first but after running it for a while I noticed that it appeared to be missing the binary pattern sometimes.  I wrote a script that reproduces this issue and wanted to see if someone could explain if this is expected or not.

The script creates a file that has the binary pattern 0000000167 in it three times.  Each instance of the pattern is followed immediately by AA, BB, or CC, and the 1000 bytes of zeroes.

GNU grep reports that it sees this binary pattern three times.  GNU Parallel splits this up into two files though and the first file has the AA and BB instances of the pattern in it.

Do I need to do something else to make sure this pattern is checked for in a different way?

This script is written to run on Mac OS.  If you are running on Linux you'll need to change "ggrep" to "grep".

Thanks,
Tim

-- CUT HERE --
# Remove any old test results
rm -f *.test-result

# Create the test file
echo -n "Creating test file... "
echo -ne '\x00\x00\x00\x01\x67\xaa' > test.raw
dd if=/dev/zero bs=1000 count=1 >> test.raw &> /dev/null
echo -ne '\x00\x00\x00\x01\x67\xbb' >> test.raw
dd if=/dev/zero bs=1000 count=1 >> test.raw &> /dev/null
echo -ne '\x00\x00\x00\x01\x67\xcc' >> test.raw
dd if=/dev/zero bs=1000 count=1 >> test.raw &> /dev/null
echo "done."

# Count the number of times grep finds this pattern (using ggrep since we're on Mac OS)
echo -n "Instances of pattern found with grep: "
ggrep -obUaP '\x00\x00\x00\x01\x67' test.raw | wc -l

# Have GNU Parallel split up the file based on the given pattern as a regexp
cat test.raw | parallel -k --pipe --regexp --recstart '\x00\x00\x00\x01\x67' --recend '' cat\>{#}.test-result &> /dev/null

# Count the number of output files GNU Parallel created
echo -n "Output files from GNU Parallel: "
ls -la *.test-result | wc -l

# Remove the test results.  Comment this out if you want to examine them after the fact.
rm *.test-result


Sent from Mailbox

reply via email to

[Prev in Thread] Current Thread [Next in Thread]