[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#2497: 23.0.91; Fails to read UTF-8 on Win2k

From: Kenichi Handa
Subject: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Mon, 02 Mar 2009 20:43:58 +0900

In article <address@hidden>, Eli Zaretskii <address@hidden> writes:

>   M-: (coding-system-priority-list) RET
>>> (iso-latin-1 utf-8 iso-2022-7bit iso-2022-7bit-lock iso-2022-8bit-ss2 
>>> emacs-mule raw-text iso-2022-jp in-is13194-devanagari chinese-iso-8bit 
>>> utf-8-auto utf-8-with-signature utf-16 utf-16be-with-signature 
>>> utf-16le-with-signature utf-16be utf-16le japanese-shift-jis undecided)

> So UTF-8 is indeed ``pretty high'', but lower than the locale's
> default.

> > So this still looks like a real bug.

> Perhaps it is, but I didn't know Emacs 23 can reliably distinguish
> between Latin-1 and UTF-8, even when UTF-8 sequences are present in
> the text.  Can we do that reliably?  Perhaps Handa-san can shed some
> light on this.

The coding system iso-latin-1 is for the character set
iso-8859-1, and the code-space of iso-8859-1 is 0x00..0xFF
(without gap, i.e. including 0x80..0x9F) (see
/usr/share/i18n/charmaps/ISO-8859-1.gz).  So, if we follows
it strictly, any byte sequence can be a correct iso-8859-1
stream, and it means that when iso-latin-1 has the highest
priority, all files are detected as iso-latin-1.

So, as far as we strictly follows the definition of

In article <address@hidden>, Stefan Monnier <address@hidden> writes:

> That seems to be the source of the problem.  utf-8 should always come
> before latin-1 in that list, since utf-8 streams that are valid latin-1
> streams are not uncommon, whereas latin-1 streams that are valid utf-8
> streams are extremely rare.

I think that is the only solution.

In article <address@hidden>, Uwe Siart <address@hidden> writes:

> Assumed this is not possible right now we should distinguish between
> »high reliability« and »poor reliability«. From my perception it has
> been much more reliable earlier so (as a user with limited viewpoint)
> I vote for reverting the change.

In Emacs 22, the coding system iso-latin-1 was defined as a
variant of iso-2022-based coding system, and thus 0x80..0x9F
were not a valid byte (except for 0x91 and etc. in
latin-extra-code-table).  So, some of UTF-8 texts were not
detected as iso-latin-1.

To recover that behaviour, we can define iso-latin-1 as
before by doing this:

(define-coding-system 'iso-latin-1
  "Emacs 22 iso-latin-1."
  :mnemonic ?1
  :coding-type 'iso-2022
  :charset-list '(ascii latin-iso8859-1)
  :ascii-compatible-p t
  :mime-charset 'iso-8859-1
  :designation [ascii latin-iso8859-1 nil nil])

But, even with that, still some valid UTF-8 texts will be
detected as iso-latin-1.  So I don't think this is the
solution of "high reliability".

Kenichi Handa

reply via email to

[Prev in Thread] Current Thread [Next in Thread]