[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs and UTF-8 locale

From: Markus Kuhn
Subject: Re: Emacs and UTF-8 locale
Date: Tue, 18 Dec 2001 13:20:04 +0000

> Date: Mon, 17 Dec 2001 09:52:20 +0200 (IST)
> From: Eli Zaretskii <address@hidden>
> To: Richard Stallman <address@hidden>
> cc: address@hidden, address@hidden
> Subject: Re: UTF-8 locale
> On Sun, 16 Dec 2001, Richard Stallman wrote:
> >     Recent changes in mule-cmds.el automatically turn on the UTF-8
> >     locale when $LANG says so.
> > 
> > For which values of LANG does Emacs use UTF-8?
> Those which match the regexp ".*utf\\(-?8\\)\\>".

The proper way of determining the encoding used by the current locale is
not to look at a single locale variable, but to query the Single Unix
Specification (and now also POSIX) function nl_langinfo(CODESET), as for
example in

  utf8_mode = (strcmp(nl_langinfo(CODESET), "UTF-8") == 0);

There are UTF-8 locales in use (e.g., vi_VI), which do NOT have UTF-8 in
their name, therefore the direct test of the locale environment
variables is just a less reliable fallback option.

It is my understanding that elisp currently has no direct access to the
output of the API function nl_langinfo(CODESET), and I hope this can be
fixed. Alternatively, you can execute the shell command "locale charmap",
which outputs the return value of nl_langinfo(CODESET) followed by a new
line. This could be used under elisp even right now, though it is less
elegant of course.

Fortunately, there exists only one single standard string that
nl_langinfo(CODESET) returns in a UTF-8 locale, and that is "UTF-8".
(For ISO 8859-1, both "ISO-8859-1" and "ISO8859-1" are used by
different manufacturers.)

There is at the moment only one widely used system that does not yet
implement nl_langinfo(3) or locale(1) (namely *BSD), and on such a
system, you can do as a fallback something like

 char *s;
  int utf8_mode = 0;

  if ((s = getenv("LC_ALL")) ||
      (s = getenv("LC_CTYPE")) ||
      (s = getenv("LANG"))) {
    if (strstr(s, "UTF-8"))
      utf8_mode = 1;

It is important that you do not only test LANG, but the first variable
in the sequence LC_ALL, LC_CTYPE and LANG that has a value. Many UTF-8
users strongly prefer LC_CTYPE=en_GB.UTF-8 LANG=C, as this changes only
the encoding but not the sorting order etc., and it also speeds up
program start time, as the C libraray will only load the LC_CTYPE part
of the locale data, and not all the unwanted rest.

If you need an autoconf test for the presence of nl_langinfo(CODESET),
then here is one:

======================== m4/codeset.m4 ================================
#serial AM1

dnl From Bruno Haible.

  AC_CACHE_CHECK([for nl_langinfo and CODESET], am_cv_langinfo_codeset,
    [AC_TRY_LINK([#include <langinfo.h>],
      [char* cs = nl_langinfo(CODESET);],
  if test $am_cv_langinfo_codeset = yes; then
      [Define if you have <langinfo.h> and nl_langinfo(CODESET).])

For more information on how applications should activate UTF-8 modes,
please have a look at:



Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

reply via email to

[Prev in Thread] Current Thread [Next in Thread]