bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#73360: Error when a long list is provided to grep with "--binary-fil


From: Rodrigo Jorge
Subject: bug#73360: Error when a long list is provided to grep with "--binary-files=without-match" option
Date: Fri, 20 Sep 2024 11:22:17 -0300

Ok, more things were discovered. After I had a problem exactly at the
"xargs -n 2872", I ran the xargs again with the "-t" flag to get the
command, and noticed that the 2 missing files were exactly the 2 last ones
on the command file list.

grep -Il . "{ 2870 files }" ./apex/images/apex_ui/psd/apex_5_ui.ai
./apex/images/apex_ui/psd/apex-logo.ai

Now if I run:

[user@server folder]$ cat /tmp/cmd1
grep -Il . ./apex/images/apex_ui/psd/apex_5_ui.ai ./apex/images/apex_ui/psd/
apex-logo.ai ... "{ 2870 files }"

[user@server folder]$ wc -c /tmp/cmd1
131049 /tmp/cmd1

[user@server folder]$ cat /tmp/cmd2
grep -Il . "{ 2870 files }" ./apex/images/apex_ui/psd/apex_5_ui.ai
./apex/images/apex_ui/psd/apex-logo.ai
[user@server folder]$ wc -c /tmp/cmd2
131049 /tmp/cmd2


[user@server folder]$ sh /tmp/cmd1 | wc -l
1072
[user@server folder]$ sh /tmp/cmd2 | wc -l
1070

In other words, depending on the location on the command line where those 2
files are provided to grep, we will have a different result.

Can I run those 2 grep commands with some sort of debug flag and send them
back for analysis? The file list is exactly the same, just changing the
file order.

Thanks,
Rodrigo

On Fri, Sep 20, 2024 at 10:54 AM Rodrigo Jorge <rodrigoaraujorge@gmail.com>
wrote:

> I could reproduce the same issue without xargs, so I think we can take it
> out of the picture:
>
> [user@server folder]$ find -type f -not -path "./.patch_storage/*" -not
> -name "tfa_setup" -print > /tmp/file.list
> [user@server folder]$ wc -l /tmp/file.list
> 37443 /tmp/file.list
>
> [user@server folder]$ cat /tmp/file.list | xargs -n 100 grep -Il '.' >
> /tmp/list1.list
> [user@server folder]$ wc -l /tmp/list1.list
> 23405 /tmp/list1.list
>
> [user@server folder]$ grep -Il '.' $(cat /tmp/file.list) > /tmp/list2.list
> [user@server folder]$ wc -l /tmp/list2.list
> 23403 /tmp/list2.list
>
> [user@server folder]$ diff /tmp/list1.list /tmp/list2.list
> 12268,12269d12267
> < ./apex/images/apex_ui/psd/apex_5_ui.ai
> < ./apex/images/apex_ui/psd/apex-logo.ai
> [user@server folder]$
>
> So we can see that running *"grep -Il '.' $(cat /tmp/file.list)"* will
> also skip those 2 files, unless the problem is actually bringing them, and
> xargs are adding those 2 files somehow.
>
> Those files are PDFs:
>
> [user@server folder]$ file ./apex/images/apex_ui/psd/apex_5_ui.ai
> ./apex/images/apex_ui/psd/apex_5_ui.ai: PDF document, version 1.5
> [user@server folder]$ file ./apex/images/apex_ui/psd/apex-logo.ai
> ./apex/images/apex_ui/psd/apex-logo.ai: PDF document, version 1.5
>
> [user@server folder]$ head ./apex/images/apex_ui/psd/apex_5_ui.ai
> %����1.5
> <</Length 39582/Subtype/XML/Type/Metadata>>stream8 0 R 209 0 R]/ON[6 0 R 7
> 0 R 210 0 R]/Order 211 0 R/RBGroups[]>>/OCGs[6 0 R 7 0 R 5 0 R 208 0 R 210
> 0 R 209 0 R]>>/Pages 3 0 R/Type/Catalog>>
> <?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.3-c011
> 66.145661, 2012/02/06-14:56:27        ">
>    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";>
>       <rdf:Description rdf:about=""
>             xmlns:dc="http://purl.org/dc/elements/1.1/";>
>          <dc:format>application/pdf</dc:format>
>          <dc:title>
>             <rdf:Alt>
>
> I could also find exactly the point it breaks:
>
> [user@server folder]$ cat /tmp/file.list | xargs -n 100 grep -Il '.' | wc
> -l
> 23405
> [user@server folder]$ cat /tmp/file.list | xargs -n 1000 grep -Il '.' |
> wc -l
> 23405
> [user@server folder]$ cat /tmp/file.list | xargs -n 2000 grep -Il '.' |
> wc -l
> 23405
> [user@server folder]$ cat /tmp/file.list | xargs -n 2871 grep -Il '.' |
> wc -l
> 23405
> [user@server folder]$ cat /tmp/file.list | xargs -n 2872 grep -Il '.' |
> wc -l
> 23403
>
> I will reply shortly with the strace findings.
>
> On Fri, Sep 20, 2024 at 10:32 AM David G. Pickett <dgpickett@aol.com>
> wrote:
>
>> While the output may be bulky, on Linux you can try the strace command to
>> see exactly what it is up to.  It will show the execvp() call, for
>> instance.  You might need a bigger -s!
>>
>> $ strace -f -v -s 262144 <YOUR_CMD>
>>
>> On Thursday, September 19, 2024 at 10:29:30 AM EDT, Rodrigo Jorge <
>> rodrigoaraujorge@gmail.com> wrote:
>>
>>
>> Hello. I'm trying to use grep to get the list of all non-binary files in a
>> given folder. I tried with the 2.20 and the 3.11 release.
>>
>> For some reason, grep is providing 2 false negatives when the list is
>> huge.
>> This issue does not happen if I break the grep input with "xargs -n X".
>>
>> Check below:
>>
>> [opc@oradiff-core dbhome_1]$ grep -V
>> grep (GNU grep) 3.11
>> Copyright (C) 2023 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <
>> https://gnu.org/licenses/gpl.html>.
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.
>>
>> Written by Mike Haertel and others; see
>> <https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.
>>
>> [opc@oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
>> -not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 -n 100 grep
>> -Il '.' > /tmp/list1.list
>>
>> [opc@oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
>> -not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 grep -Il '.'
>> > /tmp/list2.list
>>
>> [opc@oradiff-core dbhome_1]$ diff /tmp/list1.list /tmp/list2.list
>> 12268,12269d12267
>> < ./apex/images/apex_ui/psd/apex_5_ui.ai
>> < ./apex/images/apex_ui/psd/apex-logo.ai
>>
>> [opc@oradiff-core dbhome_1]$ wc -l /tmp/list1.list /tmp/list2.list
>>   23397 /tmp/list1.list
>>   23395 /tmp/list2.list
>>   46792 total
>>
>> The output should not show any difference.
>>
>> The same issue was also reproduced in grep 2.20.
>>
>> Thanks,
>> Rodrigo
>>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]