Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't

pspp-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't

From:	Ben Pfaff
Subject:	Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't be right.
Date:	Thu, 01 Mar 2012 21:29:21 -0800
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)
Thanks.  I pushed this change.

Your further enhancement suggestion makes sense, but I don't want
to work on it right now, so I filed a bug report:
        https://savannah.gnu.org/bugs/index.php?35688

John Darrington <address@hidden> writes:

> It looks fine to me.
>
> Would it be possible to generalize it?  That is to say, could we make
> sure, that when the fallback encoding is X, where X is multi-byte encoding,
> but we know that the input is not X, that it also falls back to windows-1252?
>
> On Wed, Feb 29, 2012 at 10:44:30PM -0800, Ben Pfaff wrote:
>      Until now the encoding-guesser code has used UTF-8 as a fallback in
>      situations where we can tell that the file is not valid UTF-8.  In
>      this kind of situation having a single-byte character set as a
>      fallback makes more sense.  This commit hard-codes windows-1252 as
>      that fallback, since it is a widely encountered encoding (and
>      compatible with ISO-8859-1 as well).
>      
>      John Darrington originally suggested this, if I recall correctly.
>      
>      The bug report that spurred this work was from Harry Thijssen.  With
>      this commit, PSPP properly reads his windows-1252 file when the
>      system locale uses UTF-8 encoding.
>      ---
>      I'm looking for a review of this patch before I push it to master.
>      Thanks!
>      
>       doc/utilities.texi                |   23 +++++++++++-----
>       src/libpspp/encoding-guesser.c    |   55 
> ++++++++++++++++++++++++-------------
>       src/libpspp/encoding-guesser.h    |    8 +++--
>       src/libpspp/i18n.c                |   14 +++++++++
>       src/libpspp/i18n.h                |    2 +
>       src/libpspp/u8-istream.c          |    5 ++-
>       tests/libpspp/encoding-guesser.at |    8 +++++
>       7 files changed, 84 insertions(+), 31 deletions(-)
>      
>      diff --git a/doc/utilities.texi b/doc/utilities.texi
>      index 40648d4..35dd393 100644
>      --- a/doc/utilities.texi
>      +++ b/doc/utilities.texi
>      @@ -313,14 +313,23 @@ are @code{ASCII} (United States), 
> @code{ISO-8859-1} (western Europe),
>       @code{EUC-JP} (Japan), and @code{windows-1252} (Windows).  Not all
>       systems support all character sets.
>       
>      address@hidden @code{Auto}
>       @item @code{Auto,@var{encoding}}
>      -Automatically detects whether a syntax file is encoded in
>      address@hidden or in a Unicode encoding such as UTF-8, UTF-16, or
>      -UTF-32.  The @var{encoding} may be an IANA character set name or
>      address@hidden (the default).  Only ASCII compatible encodings can
>      -automatically be distinguished from UTF-8 (the most common locale
>      -encodings are all ASCII-compatible).
>      +Automatically detects whether a syntax file is encoded in an Unicode
>      +encoding such as UTF-8, UTF-16, or UTF-32.  If it is not, then PSPP
>      +generally assumes that the file is encoded in @var{encoding} (an IANA
>      +character set name).  However, if @var{encoding} is UTF-8, and the
>      +syntax file is not valid UTF-8, PSPP instead assumes that the file
>      +is encoded in @code{windows-1252}.
>      +
>      +For best results, @var{encoding} should be an ASCII-compatible
>      +encoding (the most common locale encodings are all ASCII-compatible),
>      +because encodings that are not ASCII compatible cannot be
>      +automatically distinguished from UTF-8.
>      +
>      address@hidden @code{Auto}
>      address@hidden @code{Auto,Locale}
>      +Automatic detection, as above, with the default encoding taken from
>      +the system locale or the setting on SET LOCALE.
>       @end table
>       
>       When ENCODING is not specified, the default is taken from the
>      diff --git a/src/libpspp/encoding-guesser.c 
> b/src/libpspp/encoding-guesser.c
>      index 27e2cda..bee2978 100644
>      --- a/src/libpspp/encoding-guesser.c
>      +++ b/src/libpspp/encoding-guesser.c
>      @@ -36,22 +36,26 @@
>          of information about encoding detection.
>       */
>       
>      -/* Parses and returns the fallback encoding from ENCODING, which must 
> be in one
>      -   of the forms described at the top of encoding-guesser.h.  The 
> returned
>      -   string might be ENCODING itself or a suffix of it, or it might be a
>      -   statically allocated string. */
>      +/* Returns the encoding specified by ENCODING, which must be in one of 
> the
>      +   forms described at the top of encoding-guesser.h.  The returned 
> string might
>      +   be ENCODING itself or a suffix of it, or it might be a statically 
> allocated
>      +   string. */
>       const char *
>       encoding_guess_parse_encoding (const char *encoding)
>       {
>      +  const char *fallback;
>      +
>         if (encoding == NULL
>             || !c_strcasecmp (encoding, "auto")
>             || !c_strcasecmp (encoding, "auto,locale")
>             || !c_strcasecmp (encoding, "locale"))
>      -    return locale_charset ();
>      +    fallback = locale_charset ();
>         else if (!c_strncasecmp (encoding, "auto,", 5))
>      -    return encoding + 5;
>      +    fallback = encoding + 5;
>         else
>           return encoding;
>      +
>      +  return is_encoding_utf8 (fallback) ? "windows-1252" : fallback;
>       }
>       
>       /* Returns true if ENCODING, which must be in one of the forms 
> described at the
>      @@ -267,16 +271,37 @@ const char *
>       encoding_guess_tail_encoding (const char *encoding,
>                                     const void *data, size_t n)
>       {
>      -  return (encoding_guess_tail_is_utf8 (data, n)
>      +  return (encoding_guess_tail_is_utf8 (data, n) != 0
>                 ? "UTF-8"
>                 : encoding_guess_parse_encoding (encoding));
>       }
>       
>      -/* Same as encoding_guess_tail_encoding() but returns true for UTF-8 or 
> false
>      -   for the fallback encoding. */
>      -bool
>      +/* Returns an encoding guess based on ENCODING and the N bytes of text 
> starting
>      +   at DATA.  DATA should start with the first non-ASCII text character 
> (as
>      +   determined by encoding_guess_is_ascii_text()) found in the input.
>      +
>      +   The return value is:
>      +
>      +       0, if the encoding is definitely not UTF-8 (because the input 
> contains
>      +       byte sequences that are not valid in UTF-8).
>      +
>      +       1, if the encoding appears to be UTF-8 (because the input 
> contains valid
>      +       UTF-8 multibyte sequences).
>      +
>      +       -1, if the input contains only ASCII characters.  (This means 
> that the
>      +       input may be treated as UTF-8, since ASCII is a subset of UTF-8.)
>      +
>      +   See encoding-guesser.h for intended use of this function.
>      +
>      +   N must be at least ENCODING_GUESS_MIN, unless the file has fewer 
> bytes than
>      +   that starting with the first non-ASCII text character. */
>      +int
>       encoding_guess_tail_is_utf8 (const void *data, size_t n)
>       {
>      +  /* If all the bytes are in the ASCII range, it's just ASCII. */
>      +  if (encoding_guess_count_ascii (data, n) == n)
>      +    return -1;
>      +
>         return (n < ENCODING_GUESS_MIN
>                 ? u8_check (data, n) == NULL
>                 : is_all_utf8_text (data, n));
>      @@ -297,15 +322,7 @@ encoding_guess_whole_file (const char *encoding, 
> const void *text, size_t size)
>       
>         guess = encoding_guess_head_encoding (encoding, text, size);
>         if (!strcmp (guess, "ASCII") && encoding_guess_encoding_is_auto 
> (encoding))
>      -    {
>      -      size_t ofs = encoding_guess_count_ascii (text, size);
>      -      if (ofs < size)
>      -        return encoding_guess_tail_encoding (encoding,
>      -                                             (const char *) text + ofs,
>      -                                             size - ofs);
>      -      else
>      -        return encoding_guess_parse_encoding (encoding);
>      -    }
>      +    return encoding_guess_tail_encoding (encoding, text, size);
>         else
>           return guess;
>       }
>      diff --git a/src/libpspp/encoding-guesser.h 
> b/src/libpspp/encoding-guesser.h
>      index 0a7d1f9..2e8cb9a 100644
>      --- a/src/libpspp/encoding-guesser.h
>      +++ b/src/libpspp/encoding-guesser.h
>      @@ -1,5 +1,5 @@
>       /* PSPP - a program for statistical analysis.
>      -   Copyright (C) 2011 Free Software Foundation, Inc.
>      +   Copyright (C) 2011, 2012 Free Software Foundation, Inc.
>       
>          This program is free software: you can redistribute it and/or modify
>          it under the terms of the GNU General Public License as published by
>      @@ -42,7 +42,9 @@
>              encoding"): Requests detection whether the input is encoded in 
> UTF-8,
>              UTF-16, UTF-32, or a few other easily identifiable charsets.  
> When a
>              particular character set cannot be recognized, the guesser falls 
> back to
>      -       the encoding following the comma.  UTF-8 detection works only for
>      +       the encoding following the comma.  When the fallback encoding is 
> UTF-8,
>      +       but the input is invalid UTF-8, then the windows-1252 encoding 
> (closely
>      +       related to ISO 8859-1) is used instead.  UTF-8 detection works 
> only for
>              ASCII-compatible character sets.
>       
>            - NULL or "Auto": As above, with the encoding used by the system 
> locale as
>      @@ -111,7 +113,7 @@ const char *encoding_guess_head_encoding (const char 
> *encoding,
>       /* Refining an initial ASCII coding guess using later non-ASCII bytes. 
> */
>       static inline bool encoding_guess_is_ascii_text (uint8_t c);
>       size_t encoding_guess_count_ascii (const void *, size_t);
>      -bool encoding_guess_tail_is_utf8 (const void *, size_t);
>      +int encoding_guess_tail_is_utf8 (const void *, size_t);
>       const char *encoding_guess_tail_encoding (const char *encoding,
>                                                 const void *, size_t);
>       
>      diff --git a/src/libpspp/i18n.c b/src/libpspp/i18n.c
>      index 9658866..c04dd5a 100644
>      --- a/src/libpspp/i18n.c
>      +++ b/src/libpspp/i18n.c
>      @@ -769,3 +769,17 @@ is_encoding_supported (const char *encoding)
>         return (create_iconv__ ("UTF-8", encoding)->conv != (iconv_t) -1
>                 && create_iconv__ (encoding, "UTF-8")->conv != (iconv_t) -1);
>       }
>      +
>      +/* Returns true if E is the name of a UTF-8 encoding.
>      +
>      +   XXX Possibly we should test not E as a string but its properties via
>      +   iconv. */
>      +bool
>      +is_encoding_utf8 (const char *e)
>      +{
>      +  return ((e[0] == 'u' || e[0] == 'U')
>      +          && (e[1] == 't' || e[1] == 'T')
>      +          && (e[2] == 'f' || e[2] == 'F')
>      +          && ((e[3] == '8' && e[4] == '\0')
>      +              || (e[3] == '-' && e[4] == '8' && e[5] == '\0')));
>      +}
>      diff --git a/src/libpspp/i18n.h b/src/libpspp/i18n.h
>      index 383ff12..d973a81 100644
>      --- a/src/libpspp/i18n.h
>      +++ b/src/libpspp/i18n.h
>      @@ -142,4 +142,6 @@ bool is_encoding_ascii_compatible (const char 
> *encoding);
>       bool is_encoding_ebcdic_compatible (const char *encoding);
>       bool is_encoding_supported (const char *encoding);
>       
>      +bool is_encoding_utf8 (const char *encoding);
>      +
>       #endif /* i18n.h */
>      diff --git a/src/libpspp/u8-istream.c b/src/libpspp/u8-istream.c
>      index c111634..77c1413 100644
>      --- a/src/libpspp/u8-istream.c
>      +++ b/src/libpspp/u8-istream.c
>      @@ -1,5 +1,5 @@
>       /* PSPP - a program for statistical analysis.
>      -   Copyright (C) 2010, 2011 Free Software Foundation, Inc.
>      +   Copyright (C) 2010, 2011, 2012 Free Software Foundation, Inc.
>       
>          This program is free software: you can redistribute it and/or modify
>          it under the terms of the GNU General Public License as published by
>      @@ -34,6 +34,7 @@
>       #include "libpspp/cast.h"
>       #include "libpspp/compiler.h"
>       #include "libpspp/encoding-guesser.h"
>      +#include "libpspp/i18n.h"
>       
>       #include "gl/c-strcase.h"
>       #include "gl/localcharset.h"
>      @@ -120,7 +121,7 @@ u8_istream_for_fd (const char *fromcode, int fd)
>           goto error;
>       
>         encoding = encoding_guess_head_encoding (fromcode, is->buffer, 
> is->length);
>      -  if (!strcmp (encoding, "UTF-8"))
>      +  if (is_encoding_utf8 (encoding))
>           is->state = S_UTF8;
>         else
>           {
>      diff --git a/tests/libpspp/encoding-guesser.at 
> b/tests/libpspp/encoding-guesser.at
>      index a2b0aab..e969a48 100644
>      --- a/tests/libpspp/encoding-guesser.at
>      +++ b/tests/libpspp/encoding-guesser.at
>      @@ -141,3 +141,11 @@ AT_CHECK([printf 
> '\343\201\201\343\201\202\343\201\203\343\201\204\343\201\205\3
>         [0], [UTF-8
>       ])
>       AT_CLEANUP
>      +
>      +AT_SETUP([windows-1252 as Auto,UTF-8])
>      +AT_KEYWORDS([encoding guesser])
>      +AT_CHECK([i18n-test supports_encodings windows-1252])
>      +AT_CHECK([printf 'entr\351e' | encoding-guesser-test Auto,UTF-8 32], 
> [0],
>      +  [windows-1252
>      +])
>      +AT_CLEANUP
>      -- 
>      1.7.2.5
>      
>      
>      _______________________________________________
>      pspp-dev mailing list
>      address@hidden
>      https://lists.gnu.org/mailman/listinfo/pspp-dev

-- 
Ben Pfaff 
http://benpfaff.org
[Prev in Thread]
Current Thread
[Next in Thread]
[PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't be right., Ben Pfaff, 2012/03/01
- Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't be right., John Darrington, 2012/03/01
  - Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't be right., Ben Pfaff <=
Prev by Date: Re: PSPP Errors on Vista
Next by Date: Re: PSPP back in Debian
Previous by thread: Re: [PATCH] encoding-guesser: Fall back to windows-1252 when UTF-8 can't be right.
Next by thread: Re: New Lithuanian PO file for 'pspp' (version 0.7.9)
Index(es):
- Date
- Thread