[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Casting as wide a net as possible

From: Random832
Subject: Re: Casting as wide a net as possible
Date: Tue, 15 Dec 2015 13:54:20 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)

Filipp Gunbin <address@hidden> writes:
> I see.  However, this doesn't seem to affect English and American
> English languages, but rather European ones.

There are occasional accented words e.g. naïve, borrowed from
other languages. And also punctuation marks (more common with
people who use certain word processing software packages that
automatically replace typewriter quotes with them).

> Honestly, I always though that those languages do not have many
> encodings in use, probably I'm wrong.

Well, obviously there’s Latin-1 and UTF-8. There’s also
Windows-1252, which is semi-compatible with Latin-1. You can
sometimes end up with the Windows-1252 bytes treated as if they
were Latin-1 C1 controls (and perhaps encoded further into
UTF-8). There are also older encodings that aren’t used much
anymore e.g. DOS 437/850, MacRoman, etc.

I¹ve also seen content that was mechanically translated from one
to another using an 8-bit mapping table, with incompatible
characters mapped arbitrarily. For example, if you ever see
something with quotes/apostrophes replaced with superscripts,
like in this paragraph, this probably means the text originated
in MacRoman and was translated to Latin-1 with the ³André
Pirard² mapping.

Anyway, the point is, since non-ASCII characters aren’t
pervasive, it’s easy to miss noticing that something’s wrong
with them. For one last demo, this paragraph features UTF-8,
treated as Windows-1252, and then re-encoded as UTF-8 again.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]