pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: even more about character encoding names


From: John Darrington
Subject: Re: even more about character encoding names
Date: Wed, 5 Jan 2011 12:02:25 +0000
User-agent: Mutt/1.5.18 (2008-05-17)

This seems to cover everything.  

A purist might object to calling windows-1252 a "superset" of iso-8859-1 ... 
they are just two different encodings, which happen to have large parts of 
they're mappings identical.

J'

On Mon, Jan 03, 2011 at 10:45:12AM -0800, Ben Pfaff wrote:
     
     I think you've told me all of this before.  It's time to write it
     down.  Here's what I have as an update to
     system-file-format.texi.  Can you look it over and verify that it
     looks accurate?  Also, if you have any system files locally that
     have other codepage numbers not already mentioned, please let me
     know which ones and I'll add them to the list.
     
     --8<--------------------------cut here-------------------------->8--
     
     From: Ben Pfaff <address@hidden>
     Date: Mon, 3 Jan 2011 10:43:21 -0800
     Subject: [PATCH] doc: Update description of character encoding information 
in system files.
     
     Based on information provided by John Darrington and on system files
     obtained freely from the Internet.
     ---
      doc/dev/system-file-format.texi |   66 
+++++++++++++++++++++++++++++++++------
      1 files changed, 56 insertions(+), 10 deletions(-)
     
     diff --git a/doc/dev/system-file-format.texi 
b/doc/dev/system-file-format.texi
     index 972b133..bf376b5 100644
     --- a/doc/dev/system-file-format.texi
     +++ b/doc/dev/system-file-format.texi
     @@ -549,14 +549,46 @@ Compression code.  Always set to 1.
      Machine endianness.  1 indicates big-endian, 2 indicates little-endian.
      
      @item int32 character_code;
     address@hidden
     -Character code.  1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
     -indicates 8-bit ASCII, 4 indicates DEC Kanji.
     -Windows code page numbers are also valid.
     -
     -Experience has shown that in many files, this field is ignored or 
incorrect.
     -For a more reliable indication of the file's character encoding
     -see @ref{Character Encoding Record}.
     address@hidden Character code.  The following values have
     +been actually observed in system files:
     +
     address@hidden @asis
     address@hidden 2
     +7-bit ASCII.
     +
     address@hidden 1250
     +The @code{windows-1250} code page for Central European and Eastern
     +European languages.
     +
     address@hidden 1252
     +The @code{windows-1252} code page for Western European languages, a
     +superset of ISO 8859-1.
     +
     address@hidden 28591
     +ISO 8859-1.
     +
     address@hidden 65001
     +UTF-8.
     address@hidden table
     +
     +The following additional values are known to be defined:
     +
     address@hidden @asis
     address@hidden 1
     +EBCDIC.
     +
     address@hidden 3
     +8-bit ``ASCII''.
     +
     address@hidden 4
     +DEC Kanji.
     address@hidden table
     +
     +Other Windows code page numbers are known to be generally valid.
     +
     +Old versions of SPSS always wrote value 2 in this field, regardless of
     +the encoding in use.  Newer versions also write the character encoding
     +as a string (see @ref{Character Encoding Record}).
      @end table
      
      @node Machine Floating-Point Info Record
     @@ -959,8 +991,22 @@ The name of the character encoding.  Normally this 
will be an official IANA char
      See @url{http://www.iana.org/assignments/character-sets}.
      @end table
      
     -This record is not present in files generated by older software.
     -See also @ref{character-code}.
     +This record is not present in files generated by older software.  See
     +also the @code{character_code} field in the machine integer info
     +record (@pxref{character-code}).
     +
     +When the character encoding record and the machine integer info record
     +are both present, all system files observed in practice indicate the
     +same character encoding, e.g.@: 1252 as @code{character_code} and
     address@hidden as @code{encoding}, 65001 and @code{UTF-8}, etc.
     +
     +If, for testing purposes, a file is crafted with different
     address@hidden and @code{encoding}, it seems that
     address@hidden controls the encoding for all strings in the
     +system file before the dictionary termination record, including
     +strings in data (e.g.@: string missing values), and @code{encoding}
     +controls the encoding for strings following the dictionary termination
     +record.
      
      @node Long String Value Labels Record
      @section Long String Value Labels Record
     -- 
     1.7.1
     
     
     -- 
     Ben Pfaff 
     http://benpfaff.org

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.


Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]