lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev hyhenation (was tech. question: translating strings)


From: Klaus Weide
Subject: lynx-dev hyhenation (was tech. question: translating strings)
Date: Tue, 7 Sep 1999 09:29:30 -0500 (CDT)

On Tue, 7 Sep 1999, Vlad Harchev wrote:

> On Mon, 6 Sep 1999, Klaus Weide wrote:

>  Libhnj builds a finite state machine while reading hyrules. After it's read,
> that state machine is used for hyphenation. Obviously, the characters of the
> word become and 'input' for this state-machine. The 'pattern' is associated
> with each state - it's 'yielded' to aggregate into the info about how the
> given word can be hyphenated.
>  Using utf for non-english languages that doesn't use latin letter, such as
> russian, will increareas the number of states in that state machine by the
> length in bytes of the utf8-encoded russian letter, or

So is it possible at all or not, to apply the aplgorithm in utf-8 form?
In some other messages I got the impression that it wasn't.  I am just
asking for clarification, not saying it would be a good idea to do it
that way.

>  if wide character strings will be used, the input to this state machine will
> be int's - so it should be hacked.

I still think that would be the best way in which the algorithm should be
apllied.  Not necessarily for storing the rules, or for representing text
within lynx (except temporarily while applying the rules).

> > > * IMO we can turn lynx is a powerfull charset translator with a very cheap
> > >     hack ( I mean adding something like 'lynx -recode utf-8 koi8-r < in 
> > > >out')
 
>  Why don't provide this script with lynx distribution? 

Who - we, me, or you?
I can't speak for you, but afaiac it's not clean enough, too system (shell)
dependent, and of limited use, and there are better tools for the job.
And providing and documenting this more or less implies that is has to
be kept working in the future.

> (Ie add /bin directory
> with scripts, etc. Some utiliies for hyphenation should also be bundled
> with lynx (seems Raph stoppped maintaining his libhnj, so it's not wise
> to include those utilities with libhnj) - like adding exception to the set of
> the hyphenation exceptions.
>  Or may be another commandline option can be used to avoid the sex with
> lynx.cfg or .lynxrc - may be this will be better in this particular case (I
> mean 'recode' behavior).

If working as a filter for charset translation were one of the normal
modes of lynx (rather than an unsupported hack) then yes, it should
be controlled by command line options.



>  I plan to implement the following confiruation setings and commandline
> options:

I hope Henry will have something to say here, so I am not going to talk
about the number of new top-level options...

> HYHENATE:TRUE #or FALSE
> 
>  HYPHENDIR:dirname
> # the name of the directory where hyrules files are located, if their name is
> # not absolute.
> 
>  HYPHENDICT:tag:<FILESPEC>:CHSET
> # each set of files with hyrules can be assigned a tag - an string without
> # ':' in the name - that tag will be used in referring to it.
> # <FILESPEC> specifies the filenames which should be concatenated to get the
> # required set of hyrules. It has the following grammar:
> # <FILESPEC>: <FILE> | <FILE>+<FILESPEC>
> # ie a list of file names (some of them can be non-asbolute) separated with'+'
> # CHSET is chset of the resultant set of hyrules - the name of the chset
> # known to lynx. If omitted, iso-8859-1 will be assumed.

You should use a different separator than '+'.  Didn't we go through this
a while ago wrt. INCLUDE?

You don't explain the nature of the binding between CHSET and TAG (it
should be capitalized).  I.e. it's not just the charset in which the
rules are given, but the rules only apply if that charset is selected
as d.c.s.

> HYPHENCTL:TAG:<LANGSPEC>:<URLSPEC>
> # specifies the conditions of activating set of hyrules tagged with TAG. If
> # TAG is '-', then no hyphenation will be applied
> # LANGSPEC specifies the content-language provided by http or <html lang=>
> # or <META http-equiv ..>. It has the following grammar:
> # <LANGSPEC>: * | <CONCRETE_LANGSPEC>
> # <CONCRETE_LANGSPEC>: LANGNAME | LANGNAME,<CONCRETE_LANGSPEC>

I find your BNF-like syntax hard to understand, and I do understand the
form that is used in RFCs.  This will be incomprehensible for most
people.  As the prettysrc settings, but I digress.  A form where you use
'LANGNAME[...]' and explain in that that the '...' means optional more,
comma separated, has a better chance to be understood.

> # Ie '*' (that matches unspecified language) or list of language names such as
> # 'en' (defined by RFC1766).
> # <URLSPEC> specifies URLs for which it's applicable:
> # <URLSPEC>: * | <URLSPEC_PATTERNS>
> # <URLSPEC_PATTERNS>: <URLSPEC_PATTERN> | <URLSPEC_PATTERN>,<URLSPEC_PATTERNS>
> # where <URLSPEC_PATTERN> can be one of the following:
> # address@hidden
> # @domain_suffix
> # where path will be matched from the begining of the remote path, and
> # domain_suffix will be matched from the end of the domain name excluding port
> # number (e.g. "@.edu", "tranlsations/address@hidden")

Yet another and _completely_ unintuitive way for specifying URL matching,
that's just too horrible to be true.  The only possible reason is that you
are too lazy to parse URL patters that are given in normal URL order,
so you dump it on the user to learn a new syntax.

> # This setting will help to try to avoid collision of hyrules for languages
> # that have common letters used in human words (like German and English).

It still doesn't make sense to talk about collisions.  That seems to
imply that a "collision-free" mode is somehow the normal case.  But
there isn't one, for nearly every combination of a human language with
nearly any other (English without accents being the big exception).
You have to explain how collision-free combination could work if you
talk about collision at all.

   Klaus


reply via email to

[Prev in Thread] Current Thread [Next in Thread]