[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: undecided vs utf-8

From: Lars Magne Ingebrigtsen
Subject: Re: undecided vs utf-8
Date: Fri, 05 Nov 2010 03:32:02 +0100
User-agent: Gnus/5.110011 (No Gnus v0.11) Emacs/24.0.50 (gnu/linux)

Kenichi Handa <address@hidden> writes:

> It's perhaps because you are in some of iso-8859-1 locale.

I don't think I am, but I might be wrong.  There are so many locale
variables, but I always try to put my machines into "C" locale.

> I don't want to add such a heuristic in
> decode-coding-string/region (the lowest functions available
> from Lisp).  Please note that above sequence is also valid
> as Big5.  If people are in Big5 locale, it's hard to answer
> which of utf-8 or big5 is preferred unless we implement NLP
> system.

I don't know how the big5 encoding looks like, but when it comes to
iso-8859-1 vs utf-8, then there are many utf-8 strings that are valid
iso-8859-1 strings, but there are few iso-8859-1 strings that are valid
utf-8 strings.  Therefore it seems to make sense to prefer utf-8 over
iso-8859-1.  Perhaps.

> Perhaps making an upper layer function that will accept a
> list of preferred coding systems will be good; something
> like this.
> (defun detect-and-decode-coding-string (str preferred)
>   (let ((detected (detect-coding-string str))
>       decided)
>     (while (and preferred (not decided)) 
>       (if (memq (car preferred) detected)
>         (setq decided (car preferred))
>       (setq preferred (cdr preferred))))
>     (decode-coding-string str (or decided (car detected)))))

Well, this is about `undecided', and the C layer does DWIM-ish
processing when you ask it to decode `undecided', doesn't it?

The use case that made me look into this -- erc -- is somewhat special.
The irc protocol does no charset tagging, and some clients send some
charsets, and some send others, which is why erc uses `undecided' as the
default coding system.  Typically on a channel you'll see somebody using
a local (iso-8859-* is popular) charset, and others using utf-8.

Perhaps the fix here isn't to do anything with `undecided' per se, but
just fix erc.  It's trivial enough -- just have the default be, say,
`undecided-or-utf-8', and then handle that by running
`detect-coding-string' over it, see whether it's utf-8, and then either
use that or pass `undecided' down into the decoding functions.

I don't know.  What do you think?

(domestic pets only, the antidote for overdose, milk.)
  address@hidden * Lars Magne Ingebrigtsen

reply via email to

[Prev in Thread] Current Thread [Next in Thread]