bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#22838: New 'Binary file' detection considered harmful


From: Eric Blake
Subject: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 10:54:52 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0

On 02/29/2016 10:40 AM, Marcello Perathoner wrote:
>> Wrong, at least according to the POSIX definition of text file.  A text
>> file is one with no encoding errors.
> 
> 
> """
> 3.397 Text File
> 
> A file that contains characters organized into zero or more lines. The
> lines do not contain NUL characters and none can exceed {LINE_MAX} bytes
> in length, including the <newline> character. Although POSIX.1-2008 does
> not distinguish between text files and binary files (see the ISO C
> standard), many utilities only produce predictable or meaningful output
> when operating on text files. The standard utilities that have such
> restrictions always specify "text files" in their STDIN or INPUT FILES
> sections.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html

> 
> 3.206 Line
> 
> A sequence of zero or more non- <newline> characters plus a terminating 
> <newline> character.
> 
> 3.87 Character
> 
> A sequence of one or more bytes representing a single graphic symbol or 
> control code.
> 
> Note:
> This term corresponds to the ISO C standard term multi-byte character, where 
> a single-byte character is a special case of a multi-byte character. Unlike 
> the usage in the ISO C standard, character here has no necessary relationship 
> with storage space, and byte is used when storage space is discussed.
> 
> See the definition of the portable character set in Portable Character Set 
> for a further explanation of the graphical representations of (abstract) 
> characters, as opposed to character encodings.
> 

Encoding errors are not characters, but bytes.  A line cannot contain
encoding errors.  Therefore, a file with encoding errors is not a text file.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]