[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev tech. question: translating strings to different charsets

From: Vlad Harchev
Subject: Re: lynx-dev tech. question: translating strings to different charsets
Date: Tue, 7 Sep 1999 14:14:50 +0500 (SAMST)

On Mon, 6 Sep 1999, Klaus Weide wrote:

> On Sun, 5 Sep 1999, Vlad Harchev wrote:
> > On Fri, 3 Sep 1999, Klaus Weide wrote:
> > 
> >  As for translation, here are my thoughts:
> > * to avoid performance decrease due to LYUCFullyTranslateString_1, the
> >   following thing can be used:
> >     the translation of each character used in hydict chset (aka "human
> >     letter")  to d.c.s. can be precalculated (since translation of even
> >     unicode "characters" is zero-state machine) - so seems flexibility is
> What does that remark in parenthesis mean?  I don't understand it
> at all.
1st remark:
 Each character that is used in human language words (like "a", unlike "%" or 
"3") I call "human letter".
2nd remark:
 I meant that for non-utf8-encoded chsets, any character can be translated
from source chset to dest chset by single table lookup - no state belong to
such translation process (unlike utf8). That's why I called it zero-state
machine (incorrectly, must be called one-state machine). Sorry for confusion.

> >     regained - user will have to specify either in hydict (as comment) or in
> >     lynx.cfg the chset used in hydict to make such translation.
> It seems you are talking about translating the patterns at runtime (at
> program start and/or each time the display character set (or something
> else?) is changed?).  That will only be general enough if the the pattern
> input before translation is in a form that is general enough for the
> languages to be covered, which in general means you have to provide
> them as UCS (in whatever encoding, e.g. UTF-8) for some languages
> (or possibly combinations of languages, if that's supposed to be
> covered too).

 IMO such generality is redundant (at least you won't convince me that I
should do this - I haven't so much free time). And what I meant is:
 When d.c.s is changed, the table of mappings from (new) d.c.s to c.s of
hyrules is build  - such table allows table-driven translation of each
character in a word being aggregated (for hyphenation) without call to
*FullyTranslate*. The fact that hyrules come not in utf8 is that they are
translated from TeX hyrules (that are not in utf since TeX doesn't support
uft8 IMO). IMO it's better to mention the chset of the hyrules in the lynxcfg,
rather than including it in the comments to the hyrules - this will allow to
concatenate the hyrules (e.g. for English and Russian) without editing the
resultant file. And I don't see any loss of flexibility not using utf-8 
(the patterns will be 3 or so times bigger if utf is used, the conversion
scripts must deal with it ... - this can be completely avoided if user
provides the correct chset name in lynx.cfg).

> >                                                                  As for
> >     Unicode, IMO even at the present state (without modification) libhnj is
> >     suitable for this - simply there will be extra (that can be avoided with
> >     cleverer approach - of using 'int' instead of 'char') states used by UTF
> >     prefixes.
> Again I don't understand.  Are you talking about a specific encoding,
> UTF-8, when you write "Unicode"?  I don't know what kind of "states"
> you mean.

 utf-8 in that case (since "utf prefixes" is used).

 Libhnj builds a finite state machine while reading hyrules. After it's read,
that state machine is used for hyphenation. Obviously, the characters of the
word become and 'input' for this state-machine. The 'pattern' is associated
with each state - it's 'yielded' to aggregate into the info about how the
given word can be hyphenated.
 Using utf for non-english languages that doesn't use latin letter, such as
russian, will increareas the number of states in that state machine by the
length in bytes of the utf8-encoded russian letter, or
 if wide character strings will be used, the input to this state machine will
be int's - so it should be hacked.

> > * IMO we can turn lynx is a powerfull charset translator with a very cheap
> >     hack ( I mean adding something like 'lynx -recode utf-8 koi8-r < in 
> > >out')
> >     IMO this worth this.
> Lynx already is a "powerfull charset translator" that one could use
> in place of packages like "recode" etc., although one should expect
> those specific packages to be better (more correct / more general /
> more flexible / more efficient) at the job they were written for.
> Lynx just doesn't have a convenient syntax to invoke it as a filter
> for this (maybe to encourage to use "the right tool for the right
> job").
> But try the appended script.  It will only work right if there is
> no ~/.lynxrc.  (It would probably better to temporarily mess with
> ~/.lynxrc instead of messing with lynx.cfg, and just using -cfg=/dev/null
> for speed.)  Yes it requires bash, won't work with any Bourne-like shell.
>  Klaus
> --------------- -----------------------------------
> #! /bin/bash
> if [ $# -ne 3 -a $# -ne 2 ]; then
>    echo "Usage: $0 cs_in cs_out [file]" >&2
>    exit 1
> fi
> LYNX="${LYNX:-lynx}"
> LYNX_CFG="${LYNX_CFG:-/usr/local/lib/lynx.cfg}"
> file="${3:-/dev/stdin}"
> if [ $# = 3 -a "$file" != "/dev/stdin" ]; then
>    cat "$file" | $0 $1 $2
> else
>    $LYNX -assume_charset="$1" -assume_local_charset="$1" \
>         -cfg <(sed -e "s/^#\?CHARACTER_SET:.*/CHARACTER_SET:$2/" "$LYNX_CFG") 
> \
>         -dump "$file"
> fi

 Thanks for the script.

 Why don't provide this script with lynx distribution? (Ie add /bin directory
with scripts, etc. Some utiliies for hyphenation should also be bundled
with lynx (seems Raph stoppped maintaining his libhnj, so it's not wise
to include those utilities with libhnj) - like adding exception to the set of
the hyphenation exceptions.
 Or may be another commandline option can be used to avoid the sex with
lynx.cfg or .lynxrc - may be this will be better in this particular case (I
mean 'recode' behavior).

 I plan to implement the following confiruation setings and commandline


# the name of the directory where hyrules files are located, if their name is
# not absolute.

# each set of files with hyrules can be assigned a tag - an string without
# ':' in the name - that tag will be used in referring to it.
# <FILESPEC> specifies the filenames which should be concatenated to get the
# required set of hyrules. It has the following grammar:
# ie a list of file names (some of them can be non-asbolute) separated with'+'
# CHSET is chset of the resultant set of hyrules - the name of the chset
# known to lynx. If omitted, iso-8859-1 will be assumed.

# specifies the conditions of activating set of hyrules tagged with TAG. If
# TAG is '-', then no hyphenation will be applied
# LANGSPEC specifies the content-language provided by http or <html lang=>
# or <META http-equiv ..>. It has the following grammar:
# Ie '*' (that matches unspecified language) or list of language names such as
# 'en' (defined by RFC1766).
# <URLSPEC> specifies URLs for which it's applicable:
# where <URLSPEC_PATTERN> can be one of the following:
# address@hidden
# @domain_suffix
# where path will be matched from the begining of the remote path, and
# domain_suffix will be matched from the end of the domain name excluding port
# number (e.g. "", "tranlsations/address@hidden")
# This setting will help to try to avoid collision of hyrules for languages
# that have common letters used in human words (like German and English).

 Commandline options:

 Best regards,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]