bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#22001: Is it possible to tab separate concatenated files?


From: Linda Walsh
Subject: bug#22001: Is it possible to tab separate concatenated files?
Date: Thu, 26 Nov 2015 15:52:46 -0800
User-agent: Thunderbird





Bob Proulx wrote:

That example shows a completely different problem.  It shows that your
input plain text files have no terminating newline, making them
officially[/sic/] not plain text files but binary files.

Because every plain
text line in a file must be terminated with a newline.
----
   That's only a recent POSIX definition.  It's not related to
real life.  When I looked for a text file definition on google, nothing
was mentioned about needing a newline on the last line -- except on
1 site -- and that site was clearly not talking about 'text' files, but
Unix-text-record files w/each record terminated by a NL char.

   On a mac, txt files have records separated by 'CR', and on DOS/Win,
txt files have txt records separated by CRLF.  Wikipedia quotes the
Unicode definition of txt files -- which doesn't require the POSIX
txt-record definition.  Also POSIX limits txt format to 'LINE_MAX' bytes --
notice it says 'bytes' and not characters.  Yet a unicode line of 256
characters can easily exceed 1024 bytes.  Yet never in the the history of
the english language have lines been restricted to some number of bytes or
characters.  But one could note that the posix definition ONLY refers
to files -- not streams of TEXT (whatever the character set).
   Specificially, note, that with 'TEXT COLUMNMS', describe text
columns measured in column widths -- yet that conflicts with the
definition Text File, in that textfiles use 'bytes' for a maximum
line length, while text columns use 'characters' (which can be
1-4 bytes in unicode, UTF-8 or UTF-16 encoded).
   Of specific note -- "text" composed of characters, MUST
support 'NUL' (as well as 'the audio bell' (control-g), the
backspace (control-h), vertical tabs(U+000B), form-feed(U+000C).

   No standard definition outside POSIX include any of those
characters -- because text characters are supposed to be readable
and visible.  But POSIX compatibility claims that Portable
Character Set
( http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_01)
must include those characters.

The 'text'-files-must-have-NL' group ignores the POSIX 2008 definition of
a portable character set -- but globs onto the implied definition
of a text line as part of a 'text file'.

   But as already noted, POSIX has conflicting definitions about what text
is.  (Unicode measured in chars/columns or ascii (measured in bytes).  But
POSIX 2008 (same url as above) clearly states:
A null character, NUL, which has all bits set to zero, shall be in the set of [supported] characters.

   In all plain-text definitions, it is mentioned that 'text' is is a
set of displayable characters that can be broken into lines with the
text-line separator definition.  The last line of the file Needs No
separation character at the end of the line as it doesn't need to be
separated from anything.

   The GNU standard should not limit itself to an *arcane* (and not well
known outside of POSIX-fans) definition of text, as it makes text files
created before 2008, potentially incompatible.

   POSIX was supposed to be about portability... it certainly doesn't
follow the internet-design-mime of "Accept input liberally, and generate
output conservatively.

If they are
not then it isn't a text line.  Must be binary.
---
   Whereas I maintain that Newlines are required to break plain-text
into records -- but not at the end-of-file, since there is no record
following.


Why isn't there a newline at the end of the file?  Fix that and all of
your problems and many others go away.
---
   Didn't used to be a requirement -- it was added because of a broken
interpretation of the posix standard.  Please remember that a a posixified
definition of 'X' (for any X), may not be the same as a real-live 'X'.

   In this case,  we have a file containing *text* by the POSIX
def, which you claim doesn't meet the POSIX definition of "text file".
    It's similar to Orwellian-speak -- redefining common terms to mean
something else, so people don't notice the requirement change, then later
telling others to clean-up their old input code/data that doesn't
meet the newly created definition.  Text files have been around alot
longer than 8 years.  Posix disqualifies most text files, for example,
those created on the most widely laptop/desktop/commercial computerer OS
in the world (Windows).
   I think what may be true is that 'POSIX text files' describe a data
format that may not be how it is stored on disk.  I find it very
interesting in how 'NUL' is defined to be part of any POSIX text character
set definition where such apps claim to support or process 'text'.

   It's sad to see the GNU utils becoming less flexible and more
restricted over time -- much like the trend in computers to steer
the public away from general purpose processing (and computers that
can do such), to a tightly controlled, walled garden where consumers
are only allowed to do what the manufacturer tells them to do.

   I suppose it's like the trend in US government that became federal law
during the nixon years -- use of a product inconsistent with it's
labeling is a violation of federal law).  Whereas before, any usage that
wasn't prohibited by local law was allowed.  It is moving away
from a free society with specific restrictions to a controlled society
with specific, limited freedoms.
















reply via email to

[Prev in Thread] Current Thread [Next in Thread]