[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Displaying bytes (was: Inadequate documentation of silly characters

From: Richard Stallman
Subject: Re: Displaying bytes (was: Inadequate documentation of silly characters on screen.)
Date: Sun, 29 Nov 2009 11:01:21 -0500

We don't want to raise the priority of windows-1252 because it would
cause many other encodings not to be recognized.

If it turns out that windows-1252 files are the main cause of
8-bit-control characters in the buffer, here's another idea.

If visiting a file gives you some 8-bit-control characters,
ask the user "Is this file encoded in Windows encoding (windows-1252)?"
and do so if she says yes.

Here's another idea.  We could employ some heuristics to see if the
distribution of those characters seems typical for the way those
characters are used.  For instance, some of the punctuation characters
(the ones that represent quotation marks) should always have
whitespace or punctuation on at least one side.  Also, there should be
no ASCII control characters other than whitespace.  Maybe more
specific heuristics can be developed.

These could be used as conditions for recognizing the file as
windows-1252.  If these heuristics are strong enough, they could
reject nearly all false matches, provided the file is long enough.
(A minimum length could be part of the conditions.)  Then we
could increase the priority of windows-1252 without the bad
side effect of using it when it is not intended.

This is ad-hoc, and not elegant.  But the problem is important enough
in practice that an ad-hoc solution is justified if it works well.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]