[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev tech. question: translating strings to different charsets

From: Klaus Weide
Subject: Re: lynx-dev tech. question: translating strings to different charsets
Date: Mon, 6 Sep 1999 06:08:23 -0500 (CDT)

On Tue, 7 Sep 1999, Vlad Harchev wrote:
>  I can add that using unicode won't solve the problem (of hyrules collision
> when trying to hyphenate multilingual documents, say document with german and
> english) since unicode doesn't have separate codes for 'german letter f' and
> 'english letter f' - just "latin capital letter F". 

You are thinking about it the wrong way.  There is no such thing as a
"German letter f" as distinct from an "English letter f".  The language-
specificity has to come in not at the character level.  The algorithm
(in this case, for hyphenation) has to be language-specific, not the
data it operates on.

> So, there is no help from
> using unicoded documents/d.c.s. for improving hyphenation quality/avoiding
> collisions. 

Not if you are asking it the wrong questions.

There are no "collisions" in the character data.  If you like to use
that term, then the fundamental "collision" is that different people use
different languages with different rules.  So you need different algorithms
(or, in this case, different hyphenation patterns) for different languages.

If you cannot provide for that in the same program instance, it is a step
backward.  Currently, wrt. character sets at least, Lynx is "international"
in that it doesn't require a German Edition for some users, a Russian Edition
for others, etc.  You can view multiple languages using characters from
multiple scripts in the same program instance, even in the same document
(subject to the limitiation of the output medium of course).

> Only using <span lang=de>Debian</span> will help here, but such
> constructs are not used in the present-day web.

If the language is not specified in the document, then you have to provide
a way for the user to effectively set a default "assumed language".  (Or
perhaps some guessing algorithm, which should always be overrideable.)

the LANG attribute is part of HTML and won't go away (after they went to
the trouble of adding it to nearly every element).  Ignoring it completely
while adding language-specific processing is just wrong.  Maybe
understandable that you don't want to deal with it for whatever reason,
but still wrong.  Denying its releveance completely a priori is Completely
Wrong.  No matter that most documents currently on the Web don't use LANG.
That may or may not change to some degree.  But it should concern you
that those authors who _do_ use it according to specs, in multilingual
documents where it makes sense (different from your bogus "Debian"
example), will get the _wrong_ algorithm with your lazy approach.

It wouldn't be difficult in principle to make lynx keep track of LANG
attributes.  After all we are already keeping track of various properties
of elements (from start to end tag) with the stacks in SGML.c and HTML.c.
It seems you are just refusing to think about it.

More replies (to the other message) later.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]