lilypond-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf8 vs. latin1


From: Chris Sawer
Subject: Re: utf8 vs. latin1
Date: Sat, 1 Jan 2005 19:55:32 +0000

On 1 Jan 2005, at 15:22, Han-Wen Nienhuys wrote:

I thought that you had once told me that that Latin1 is a subset of
UTF-8.

This is not correct. One has to be careful to distinguish between a character set (ASCII / Latin1 / Unicode) and a mapping (encoding) used to represent text written using a particular character set in a binary file.

ASCII (128 characters) and Latin1 (ASCII + a further 128 characters) are easily represented in binary files with eight bits to a byte, as each character is simply represented by one byte. However, Unicode has over 90 000 characters, and there are a number of different mappings used to represent the characters in binary files.

UTF-8 is a variable length encoding for Unicode, using possibly several bytes to represent each character. There's a very nice introduction on Wikipedia:
http://en.wikipedia.org/wiki/UTF8

In short, the first 128 Unicode characters (which coincide with the ASCII character set) are represented using one byte, of which the first bit is 0. You could therefore say that ASCII is a subset of UTF-8.

The next 1920 characters are encoded using two bytes, the rules for which are given on the above page.

Unicode characters 128-255 are the same as Latin1, however when encoded in UTF-8, two bytes are used to encode them using the rules stated in the above link.

However, when I save a file as Latin1 and UTF8 under emacs,
then the results differ, and latin1 chars are also saved as double
bytes. Am I missing something?

No, this is expected behaviour. For example:

e = ASCII / Latin1 character 0x65 (101 decimal), Unicode character 0065
  = 01100101 in standard ASCII or Latin1
  = 01100101 in UTF-8

However:

é = Latin1 character 0xE9 (233 decimal), Unicode character 00E9
  = 11101001 in Latin1
  = 11000011 10101001 in UTF-8

This is why you are getting different results. However, all modern text editors should be able to cope with UTF-8, so the above details are hidden from the user. It is a widely used standard, and is the default encoding for XML documents.

We use UTF-8 internally to store all of the information about the pieces on Mutopia. It allows us to very easily store text which could include characters from any character set in the world. For example, we recently received a contribution from Matevž Jekovec (the Z has a small v on top) - his name is correctly recorded on the website, but at present he is unable to put the correct accent on his name in the footer using LilyPond.

[For anyone who's interested, the math behind the multi-character encoding is quite interesting. See the above page for details.]

Did you mean that Latin1 is a subset of Unicode?

This statement is indeed correct.

Or should we be using a different unicode->bytes layout scheme?

I should have thought that UTF-8 is the ideal choice for LilyPond input files, as it allows the whole Unicode character set to be used, while retaining compatibility with ASCII for ease of transition.

Chris

--

Chris Sawer - address@hidden - Mutopia team leader
Free sheet music for all at:  http://www.MutopiaProject.org/





reply via email to

[Prev in Thread] Current Thread [Next in Thread]