[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF

From: Paul Eggert
Subject: Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
Date: Sat, 26 Sep 2015 11:53:09 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0

David Kastrup wrote:
How frequent are you reading Hebrew, Arabic, Chinese, Japanese, and
Korean texts?  How relevant is your experience?

Hebrew, not so much -- Eli has far more experience with that. Arabic I was just reading last week (not natively; I use a translator). This week I was reading a lot of Turkish. In all cases I was looking at text prepared by others. In all cases my sources used UTF-8 -- not due to my influence, but because that's what's typical these days.

In my previous job I routinely had to deal with CJK text, and did so with lots of different encodings, including monstrosities such as DBCS-Host that Emacs doesn't even support. So my experience is reasonably good in this area -- better than the average random hacker anyway. If you go back 20 years, non-UTF-8 encodings such as Shift-JIS and EUC were by far the most popular in Japan. Nowadays? Sure, Shift-JIS and EUC are still used, but they're going downhill. Of the top 20 web sites in Japan (according to Alexa), 18 use UTF-8, one uses Shift-JIS, and one uses EUC on their home pages. In the w3techs survey of world web sites, 85% use UTF-8; the second most-popular encoding, ISO-8859-1, is at only 7.5%, and it's that high only because the old HTML standard made ISO-8859-1 the default.

So in practice, defaulting to UTF-8 is quite a good choice nowadays. Of course if we can get the proper encoding from the document or its envelope we should prefer that, and that should let us deal with web documents and email.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]