lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev hyphenation (was tech. question: translating strings)


From: Vlad Harchev
Subject: Re: lynx-dev hyphenation (was tech. question: translating strings)
Date: Tue, 7 Sep 1999 19:06:51 +0500 (SAMST)

On Mon, 6 Sep 1999, Klaus Weide wrote:

> [ I am only replying to some portions now; maybe more later ]
> 
> On Tue, 7 Sep 1999, Vlad Harchev wrote:
> > On Mon, 6 Sep 1999, Klaus Weide wrote:
> > 
> > > [ last part of a series of replies ]
> > > On Sun, 5 Sep 1999, Vlad Harchev wrote:
> > > And that is one good reason why translation to the d.c.s. should be
> > > deferred to a later stage, i.e. it should be done as late as possible
> > > (GridText.c instead of SGML.c) so that various pieces of code that look
> > > at the data stream can assume it is in a standard encoding.
> > 
> >  Better have it in wide characters rather than in utf8 then. 
> 
> Yes, that's one possibility.  And "wide characters" could be either
> 2 or 4 bytes.  And the format for passing text around (SGML.c ->
> HTML.c ->GridText.c) doesn't have to be the same that's used for
> "storing it" where memory usage is important (i.e. mostly HTLine).

 Then you'll spent month on redesigning lynx in this fashion (if you were
serious).
 
> > But I don't see
> > any use of it, really (it would be useful for generalized 'isalpha()',
> > 'tolower()', etc, but this IMO is used only in searching for strings).
> 
> There are losts of things that could become simpler if a Unicode
> representation were used throughout.

 They could be done simpler (ie they are done). Why do you plan to spend
precious time on unnecessary internal redesigns (be pragmatic not paranoid)
that can be spent on more useful things?

> What size is your screen (in terms of character cells)?

 80x25 (normal size).

> What is the normal zie of fonts you are using on the console?

 8x16 - koi8-8x16. But seems this depends on the documents you are using - as
I understand lynx will try to activate the font for charsets mentioned in
documents. If you wish, I can send a ls of fonts available on my system (thou'
it was RH5.2, and now I use RH6.0).

> Are you using "svgatextmode" or something similar?

 No.

> 
> Anyway all lynx does is invoke the "setfont" command with various
> arguments (well that, and some escape sequences).  If that breaks your
> system in the way you describe, then your font size is unusal (you'd
> have to adjust the hardwired font filenames in UCAuto.c) or something
> is wrong with your "setfont" command.
> 
> What I don't understand is why this happens (only?) on _exiting_ lynx,
> that should just restore the original state.  You could try to run the
> "setfont" invocations by hand.

 I didn't put much efforts in preventing this. I had this problem 3-4 times
and than disabled that functionality.
 
> >  I'm glad that you understand that UTF-8 (and UCS*) doesn't  have anything
> > with "mixing several languages that use the same repertoire in one document"
> > (I thought I thought that this was a solution).
> 
> Huh?  It was you who seemed to somehow seem a connection between "UTF8
> in documents" (i.e. externally) and "mixing languages".  Now you seem
> to change the topic to something else completely.

 May be it's my bad english. By I tried to inspire you that the use of unicode
can't prevent from hyrules collision (or incorrect hyphenation) for document
with mixed languages with non-disjoing repertoires.

> > The 'lang=' is for solving 
> > this. Why do you push "unicode" everywhere?
> 
> It is already used in lynx for the character translations.  Whether you
> know it or not, when you view a cp<something> Russian text with KOI8-R
> you are using it.  Using it as a common lingua franca allows translation
> between N charsets with O(N) instead of O(N**2) tables.  That alone
> should be good enough reasons for using it internally.

 But conversion between 2 given chsets would take much more time if Unicode is
used (and libhnj should be rewritten).

> [...] 
> >  I assume you mean these letters have equal char.codes in d.c.s.
> 
> No, not at all!
> 
> >  If I was encountering such documents, I'd compose or choose another font -
> > that means that these 2 chars will have different character codes in that
> > d.c.s. 
> 
> The point was that *there is no 8-bit charset* that has them both.

 This makes a difference. Your example is a good for illustrating why utf8
d.c.s are needed. Thanks. 
 Well, lynx without hyphenation doesn't look too bad :)
 But seems russian is one of the very few languages that doesn't use latin
letter - hebrew, arabic, greek, turkish and ukrainian are others. So, such
problem is very rare. Lynx is open source software - I believe some hacker
will do this later. Why other people should suffer from the absense of
hyphenation:) ?

> >  "and like" means CJK texts (hyphenation doesn't make sense for J, but for C
> > and K I don't know). As for utf8-encoded hyrules  - the hyphenation simply
> > won't work or dictionary won't load by libhnj. In other words, each signle 
> > byte in  hyrules denotes a single "human letter", each single byte in d.c.s.
> > denotes a single "human letter" (and not part of letter) - to make direct
> > table-driven translation possible.
> 
> You could change it to operate on shorts instead of bytes, right?

 Of course, but this will take a lot of my time (5 days of 8-hours hacking for
implementing exactly what you want - hacking libhnj, gathering SGML tables,
etc) - I can't spent so much time (remember - I have to implement lynx.cfg
settings too  -this is 3 days more). So I prefer not to deal with unicode, I
will describe interested people how to add support for utf8-d.c.s hyphenation 
in lynx. Currently, hyphenation won't be ever take place if d.c.s is utf8 or
HTCJK != NOCJK, so no crashes, just silent rejection. You won't use it, so you
won't suffer. 
I don't set utf8 d.c.s., so I won't suffer. IMO very few people use utf8
d.c.s.
 I afraid that if I'll try to implement utf8 in a limited period of time,
I'll be fired.

 As for 35 Mb of VSS - colorstyle changes takes 1Kb for each HTLine. There
were 22000 lines in that file - so that's why so big memory usage. I want it
to be fixed.

> 
>    Klaus
> 

 Best regards,
  -Vlad


reply via email to

[Prev in Thread] Current Thread [Next in Thread]