[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: undecided vs utf-8

From: Kenichi Handa
Subject: Re: undecided vs utf-8
Date: Fri, 05 Nov 2010 11:01:58 +0900

In article <address@hidden>, Lars Magne Ingebrigtsen <address@hidden> writes:

> When using erc, it decodes iso-8859-1 fine with the default `undecided'
> into encoding.  However, any utf-8 strings are, sort of, just translated
> into the same coding system:

> (decode-coding-string "u-te-\303\246ff \303\245tte" 'undecided)
>>> "u-te-æff åtte"

It's perhaps because you are in some of iso-8859-1 locale.
As I'm in ja_JP.UTF-8 locale, the above is decoded by utf-8.

> (decode-coding-string "u-te-\303\246ff \303\245tte" 'utf-8)
>>> "u-te-æff åtte"

> So, uhm...  Is this meant to be this way?  I know that guessing the
> first thing is, well, correct, sort of -- it's valid iso-8859-1,
> although very strange.  But it's also valid utf-8.  Shouldn't
> `decode-coding-string' prefer utf-8 if it's actually valid?  If it's
> valid utf-8, then it's quite likely that it's meant to be utf-8, even
> though other coding systems are also possible.

I don't want to add such a heuristic in
decode-coding-string/region (the lowest functions available
from Lisp).  Please note that above sequence is also valid
as Big5.  If people are in Big5 locale, it's hard to answer
which of utf-8 or big5 is preferred unless we implement NLP

Perhaps making an upper layer function that will accept a
list of preferred coding systems will be good; something
like this.

(defun detect-and-decode-coding-string (str preferred)
  (let ((detected (detect-coding-string str))
    (while (and preferred (not decided)) 
      (if (memq (car preferred) detected)
          (setq decided (car preferred))
        (setq preferred (cdr preferred))))
    (decode-coding-string str (or decided (car detected)))))

Kenichi Handa

reply via email to

[Prev in Thread] Current Thread [Next in Thread]