bug-global
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Binary recognition is to narrow [new suggestion]


From: Hideki IWAMOTO
Subject: Re: Binary recognition is to narrow [new suggestion]
Date: Tue, 24 Nov 2009 22:24:59 +0900

Hi.
> I would like to make the 512 a customizable variable too.

I examined the performance using attached test program.
And I confirmed that execute time did not have a significant difference
at size smaller than 512. I think that 512 is an appropriate value.


$ foreach a ( 32 64 128 256 512 1024 2048 4096 8192 )
foreach? rm -fr linux-2.6.31; tar xfj ~/download/linux/linux-2.6.31.tar.bz2; 
sync
foreach? time sh -c 'find linux-2.6.31 -type f | test_isbinary '$a' > /dev/null'
foreach? end
0.060u 0.348s 0:00.30 133.3%    0+0k 0+0io 0pf+0w
0.048u 0.344s 0:00.31 122.5%    0+0k 0+0io 0pf+0w
0.088u 0.340s 0:00.32 131.2%    0+0k 0+0io 0pf+0w
0.076u 0.364s 0:00.32 134.3%    0+0k 0+0io 0pf+0w
0.084u 0.372s 0:00.34 132.3%    0+0k 0+0io 0pf+0w
0.112u 0.368s 0:00.37 127.0%    0+0k 0+0io 0pf+0w
0.152u 0.368s 0:00.42 121.4%    0+0k 0+0io 0pf+0w
0.260u 0.368s 0:00.51 121.5%    0+0k 0+0io 0pf+0w
0.388u 0.368s 0:00.75 98.6%     0+0k 0+0io 0pf+0w


On Sat, 21 Nov 2009 15:42:11 +0900, Shigio YAMAGUCHI wrote...
> > Instead of counting characters over 127 the only test is that the first
> > 511 bytes don't contain any of the controll characters 0-8, 14-31. No
> > normal textfile would contain these.
> > 
> > Assuming that binary data is random the probability of a incorrectly
> > tagged binary would be
> > 
> > ((256-8-18)/256)^511=.00000000000000000000000170726
> > 
> > just testing 127 bits would be a bit to little
> > 
> > ((256-8-18)/256)^127=.00000123868
> 
> This is a very interesting idea.
> 
> > One of the benefits is that this will correctly tag files in uni-code as
> > text as well. Since those control characters never appears in uni-code
> > either.
> 
> This is a big merit.
> Most other multi-byte character set are sure to be designed like that,
> 
> I would like to make the 512 a customizable variable too.
> 
> $ gtags                         ... use conventional test
> 
> [File gtags.conf]
> +----------------------------
> |...
> |       :binarytest_size=512:...  ----------------------------------+
> |                                                                   |
>                                                                     v
> $ gtags                         ... use new test using the first n=512 bytes
> 
> After testing for a while, we can decide what we should do.
> Thank you for your profitable consideration.
> --
> Shigio YAMAGUCHI <address@hidden>
> PGP fingerprint: D1CB 0B89 B346 4AB6 5663  C4B6 3CA5 BBB3 57BE DDA3
> 
> 
> _______________________________________________
> Bug-global mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/bug-global

----
Hideki IWAMOTO  address@hidden

Attachment: 20091124-test_isbinary.patch
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]