|
From: | Kevin Rodgers |
Subject: | Re: coding tags and utf-16 |
Date: | Wed, 08 Feb 2006 17:32:02 -0700 |
User-agent: | Mozilla Thunderbird 0.9 (X11/20041105) |
David Kastrup wrote:
Kenichi Handa <address@hidden> writes:In article <address@hidden>, Stefan Monnier <address@hidden> writes:So, in any cases, a tag value itself is useless. Then how to detect utf-16 more reliably? In the current Emacs (i.e. Ver.22), I think we can use auto-coding-regexp-alist or auto-coding-alist. In the former case, we can register BOM patterns and also something like "\\`\\(\0[\0-\177]\\)+" for utf-16be. In the latter case, you can use more complicated heuristics in a registered function.Can't it be somehow added to detect_coding_utf_16?Yes, but usually it has no effect if, for instance, iso-8859-1 is more preferred. If only ASCII and Latin-1 characters are encoded in utf-16, all bytes (including BOM) are valid for iso-8859-1.I thought we had discussed this already. The BOM-encodings should have priority since the likelihood of a misdetection is negligible (the character pair does not make sense at the start of a text in latin-1 in any language): the only thing that can reasonably be expected to happen is that a binary file is detected as utf-16. Not much of an issue, I'd say.
Exactly. So why haven't these entries been added to auto-coding-regexp-alist?
("\\`\xEF\xBB\xBF" . utf-8) ("\\`\xFE\xFF" . utf-16-be) ("\\`\xFF\xFE" . utf-16-le) ("\\`\x00\x00\xFE\xFF" . utf-32-be) ("\\`\xFF\xFE\x00\x00" . utf-32-le)
Of course, for the BOM-less utf-16 encodings, priority should depend on the language environment.
Definitely. -- Kevin Rodgers
[Prev in Thread] | Current Thread | [Next in Thread] |