gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Patch Logs vs. character sets


From: Tom Lord
Subject: Re: [Gnu-arch-users] Patch Logs vs. character sets
Date: Fri, 4 Jun 2004 12:12:31 -0700 (PDT)


   Stephen:

    Tom> Too bad.  Welcome to string processing in the 21st century.
    Tom> Get used to it.

    You mean "last quarter of the 20th century".  In _this_ century,
    sane people will use Unicode nearly exclusively for the internals
    of new I18N software, and mostly for I18N external use, too.

I pretty much agree with that except that I have a broader perspective
(i think) on what it means.

People are used to the various unicode encoding forms, UTF-8, UTF-16,
UTF-32, and their endian variants.

My view is that there are some additional encoding forms which are
important to support in some cases.   Most of the additional encoding
forms are "degenerate" in the sense that they might not be able to
represent arbitrary Unicode strigns or they might contain codepoints
which are not actually Unicode characters.

Three of the degenerate encoding forms for Unicode that I think about
and support in at least some of my code are:


  iso-8859-1
        Only represents a subset of Unicode, but stores that subset
        as bytes, each containing a Unicode codepoint.

  ascii+non-specific
        An encoding in which programs can not obtain an answer to
        the question "What is the Nth codepoint" if the Nth byte
        of a sequence of characters is an integer in the closed range
        128..255.   In other words, in this "encoding form", you know
        that bytes 0..127 represent actual Unicode codepoints, 
        but all that you know about the other byte values is that
        they each somehow represent a single codepoint.

  bogus-32
        An encoding that could be described as "unicode+non-specific" 
        -- it can handle any 32 bit character set of which Unicode is
        a subset

I sometimes think about whether to add the other iso-8859-* varients
to the list but haven't reached any conclusion (other than having not
bothered to do it).

Some rules of thumb:

1) Unless there is a really good reason, don't require anything more than
   ascii+non-specific.   Arch (both tla and the (conceptual)
   specification) are (deliberately) still in this category.

2) Applications inevitably will have to deal with mixed encoding
   forms.

   There is a myth going around that, in the future, everything 
   is UTF-8.   That would be nice, if it were true, but it is 
   surely not.   Some applications (e.g., for high-performance
   text processing) will most assuredly _not_ want to use UTF-8
   internally.   For example, such applications may want to trade
   space for speed by using only fixed-width encodings.   On 
   the other hand, many _interfaces_ (e.g., to network protocols)
   are very likely to insist on UTF-8 (for example, due to its compact
   nature for many kinds of data).

   A simple-minded solution would just impose a thin layer over
   interfaces where that layer converts from one encoding form to
   another.   That would be inefficient and in some cases (e.g.,
   interfaces to side-effecting functions) inadequate.

   Instead, the (relatively small) set of core string-manipulation
   primitives should have interfaces which are encoding-system
   agnostic.   For example, one should be able to (properly)
   concatenate a UTF-8 and UTF-16 string.

   That's a pain in the neck to do --- you need a good library of
   encoding-agnostic string primitives to make it practical.
   But on the other hand, in the end, all of your higher level
   is encoding-form agnostic -- expressed just in terms of 
   abstract string primitives.


Applying those rules of thumb to arch, present and future, the log
file question in particular:

Log files have two kinds of data: (1) data that is supposed to be
parsable by programs, such as arch itself;  (2) data that is supposed
to be human readable but nothing more.

An easy way to solve (1) without having to think too hard about (2) is
to say that, at least at some layers of arch, all log files are
"ascii+non-specific" text.

What about (2)?   Arch itself doesn't care about how that's encoded
other than that it has to be compatible with ascii+non-specific.
Other programs, such as an archive browser, need to actually care
about data of type (2).

There is a standard solution in this kind of situation: from the
ascii+non-specific point of view there should be some data that says
how the "non-specific" data is encoded.  In this case, that means
adding an encoding header to log messages, picking a namespace for
encodings, making iso-8859-1 the retroactive default, and that's that.

The funny thing about that solution is that it means arch will work
perfectly well not only for Unicode and subsets of Unicode, but for
_any_ character set that happens to conform to "ascii+non-specific".
I would have to go out of my way to forbid people to use such
character sets for log messages.   

-t








reply via email to

[Prev in Thread] Current Thread [Next in Thread]