[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Freefont-bugs] Discussion and questions on Unicode Han Unification
Re: [Freefont-bugs] Discussion and questions on Unicode Han Unification
Tue, 26 Apr 2011 16:41:53 +0200
You wrote a lot of stuff, and asked a lot of questions, and some of
the topics are to my mind sort of mixed. Furthermore, your recent
reply makes me wonder about my understanding of the questions you were
asking... It may be we are misunderstanding one another.
On Wed, Jan 26, 2011 at 7:52 AM, Ange Gapes <address@hidden> wrote:
- Show quoted text -
> sorry this is not directly about bugs in Freefont, nor direct development
> matters, but I could not find a more generic ml for your project. But I
> think this kind of discussion is still of interest. Hopefully you will think
> I recently came to some interest on the Han unification project and problem
> they implies for texts mixing languages. As you are a font project, I guess
> you know the issues, but for those who don't, I summarize this way:
> typically for the main 3 languages (Chinese, Japanese, and Korean, though
> these last one don't use them much in modern writing, hence CJK) who use
> Chinese-originated characters (Han characters), the Unicode project has
> decided to unite the character from a same origin (Han Unification: Unihan).
> This leads to problem when the actual writing of them is different depending
> on the actual country, sometimes slightly (style), sometimes in a more
> obvious way. The Wikipedia page has good examples on the issue:
> (this is significant only if you have right fonts on the computers which
> will show actually the characters with difference).
> The way it is dealt with is:
> - you use only one of these languages, then you don't care and take only
> fonts which display your chosen language's way.
> - if you read texts of several languages, or even mixed inside a same text,
> the text can have some kind of markup then different fonts are selected.This
> is the way it is done in html, hence you can see different fonts for the
> actually same unicode character in the Wikipedia page I showed before.
> But what when you read raw text file without markup for instance? No sure
> way to tell the language for the editor and mixed characters won't show up.
> So why do I tell this all to you? I would like to know your opinion, if not
> position, towards this Unicode decision. Do you have any remarks on it?
Overall, I think the choice was a good. There were several issues
that had to be balanced.
>From what I understand of your examples, I would say the Unicode
standard allows authors the latitude to handle the issues of shared
characters among scripts in more than one way, depending on their
needs. There may have been specific oversights, but frankly, I don't
In the case of a text file, the author has a choice of either
specifying a shared character that would be understood regardless of
the language context, or else a specially-encoded alternative for the
character. It depends on what they want to do.
What would you propose?
> Also what does it mean for a project like yours?
The policy for FreeFont, and its rationale as a multi-script set of
fonts, is that it provides a set of characters that look OK together
in text of mixed writing systems.
If we were to support CJK, I'm fairly confident that there are
adequate technical means to
handle all the issues you raise
> Is it possible in a same font family to provide several different
> fonts/design for the same
> character with "context" information (= this font is preferably for Chinese
> display only,
> unless no other choice, this one for Japanese, and so on) and a default one
> maybe (in case > no context is available, use this "generic" design)?
> Is it possible in a same
> font family to provide several different fonts/design for the same character
> with "context" information (= this font is preferably for Chinese display
> only, unless no other choice, this one for Japanese, and so on) and a
> default one maybe (in case no context is available, use this "generic"
> design)? So that a software using your font only may still display different
> designs depending on the displayed language (if it knows it) or a default
> version otherwise...
A TrueType/OpenType "feature" of a font can indicate that the glyph
for a character ought to be replaced by another glyph, given it
appears in the context of a specified language script. The current
FreeFont has some instances of just this kind of thing.
The feature could specify that a the glyph of different Unicode
character, or even a glyph with no Unicode encoding should serve as
It is up to the font rendering software to actually implement the
And this of course has no bearing on the case of encoding text files.
> On a side note, I read somewhere that there were maybe some other kinds of
> characters where similar problems arise. In particular I read on a website
> about another example of Arabic characters being used in several
> country/languages but displayed slightly differently. Yet after some search,
> I could not find actual information on this specific issue, so I don't know
> if it is true, or maybe it has been fixed since then by the Unicode project
> by assigning specific characters or control characters to change the
> display? (Arabic don't have that many characters as those East Asian
> languages, hence less space issue for duplicating characters)
> Do you know about such specific Arabic-character issue? Or other issues with
> other glyphs in other alphabet?
I'm not the one to ask, although I know this issue, and others, arise.
In the case of Arabic mixed with Latin script, other problems arise, such as
accommodating the vertical range of Arabic while keeping a tight line height.
> Do you participate into Unicode standardization?
I don't at all. Some people have been involved in both FreeFont and
the Unicode standardizations.
> Do you have details on what conducted to this internally?
> Is it really ONLY a space problem?
Again, I have no inside information, but I see other issues more
essential than space.
In the case of Chinese, space for all the characters within 16 bits is
a concern. It fits, but doesn't leave a lot of room.
But as to whether it's "ONLY a space problem", I would say no.
Unicode has a reluctance, if not a fixed policy against, making
separate encodings based on purely stylistic differences. There are
cases of this, but I think somebody had to argue that somehow the
meaning of the character was different.
The characters shared among CJK are commonly viewed as being
historically and essentially the same characters, even if sometimes
the style of writing them has drifted in a different way in one region
than another. Furthermore, the style of writing the characters has
also drifted over time, in the same place.
There are western analogs.
For one example, 200 years ago, English typography featured a "long
s", which is no longer commonly used, and is very confusing for modern
readers where it appears.
Unicode encodes a "long s" glyph.
If it's important to the content of the text, the specially-encode
long s can be used (at the danger that no font on the system has a
glyph for the letter and at the risk that the reader might not
recognize the old letter.)
If not, the author can just use "s" for the "long s", as is usually
done when transcribing these old texts.
> even though there are for sure a lot of characters in these countries, it
> looks to me there are still a lot of slots unassigned, really far enough
> (that's how Unicode has been designed after all: with far enough slots for
> all history, as far as I know). So I don't see the points of keeping them
> for no reason (it's not like suddenly new alphabets of hundred of thousands
> of characters, all new, will be created in the next century).
> And in the worst case, Unicode may still be extended.
> So if you have any particularly interested link to discussion in the Unicode
> project (mailing lists maybe?) about how we came to this, this is
> interesting as well.
> I will also myself ask directly to Unicode guys later, but I first wanted to
> know the opinion of a font project whose goal would be to span on all the
> Unicode. What does that imply for you?
I'm not sure what you're getting at here.
Do you think some characters are unrepresented somehow?
Note the distinction between a "character" and the form in which it is printed.
For example, the Latin letter a has many forms (which sometimes
amounted to regional variants), but these forms don't get different
> And so on second level, why do I ask all this? Simply first of all I am
> interested in Unicode, in such questions, for personal use but also for pure
> intellectual interest (among other reasons, being myself involved in
> standardization processes, though not directly into Unicode, for now at
> least). Also because I think this is pretty sad and when I read about this,
> I didn't agree much with such moves (whereas the prime goal of Unicode was
> to support any existing character, so this looks like a step backwards; and
> also because we know that some countries, Japan at least for what I know, is
> not very into standardization, thus they don't use that much the Unicode
> encodings, like UTF-8, but localized encodings, and this kind of move won't
> make them want to change this).
Again it seems as if you think Unicode is failing to represent a character.
There are surely some omissions, but I expect they're very rare.
If I understand you rightly, I think you're missing something here.
First off, the notion of language distinction is a bit of an illusion.
The big nationalistic efforts of the 19th and 20th century enforced
the illusion of a single historical German and French and English, but
200 years ago, the languages across
Europe formed a sort of a continuum. Likewise with Eastern languages.
What is worse, languages and their scripts change in time.
Would you have a separate encoding for each valley, for each town?
A separate encoding for each generation?
Or does one buy into these nationalistic notions of separate and
un-mixable peoples and traditions?
In this sense, the Unicode represents the historical fact and common
understanding that the shared Eastern idiographs come from a common
And as I said, modern font technology can handle stylistic differences
between regional scripts, if it is called for.
Once again, it's a trade-off. It's the real world.
> And also because I am currently beginning to write what-may-become-a-book,
> in some future, not on this in particular, but this kind of topic may be
> part of it.
In case become-a-book, be sure to post a pointer on this list!
> So thanks all. Any opinion and information on the topic would be greatly
De nada. But that's enough for now!
> P.S.: and for personal use, a last question: do you plan on supporting these
> East-Asian characters in some foreseen future? In particular Japanese
> Hiragana-Katakana-Kanjis and Korean basic alphabet?