groff
[Top][All Lists]

Re: [groff] The hyphenation algorithm produces wrong results

 From: Bjarni Ingi Gislason Subject: Re: [groff] The hyphenation algorithm produces wrong results Date: Sun, 18 Mar 2018 19:15:11 +0000 User-agent: Mutt/1.5.20 (2009-12-10)

```On Sun, Mar 04, 2018 at 08:39:44AM +0100, Werner LEMBERG wrote:
>
> > .ll 1n
> > .hy 48
>
> You *must not* use such values if the patterns don't allow it!  From
> groff.texi:
>
I may.  That is what testing is about.  And I must, otherwise it is an
insufficient testing.

>[...]
>
>   instead of the correct `split-ting'.  US-English patterns as
>   distributed with groff need two characters at the beginning and
>   three characters at the end; this means that address@hidden of
>   should be avoided.
>

The pattern file has patterns of type '[135]xy.'.  The stated number
(three at the end) could thus include the 'period' (.) at the end of
the pattern.  But it looks to me that the pattern file was created with
the rightmarginlimit = 2.

There are many hyphenation points in dictionaries that split two
letters at the beginning; also at the end.

If one wants to use these patterns, hy=4 and hy=8 are not to be used!

>[...]
>
> > The algorithm
> >
> > 1) uses pattern in the wrong places, at the beginning of a word
> >    although no period is in the pattern
>
> You have a too simplistic view how patterns work...
>

Yes, I do.

>[...]
> OK, let's look at the word `splitting', using the `patternize.lua'
> demo program from the padrinoma project
>
>   > texlua patternize.lua -p hyphen.us -l 1 -t 1 -m 1 -v
>   pattern file: hyphen.us (4555 patterns read)
>   spot mins, special characters: 1 1 '-=.'
>

Good to know this!  I have installed it, so I can use, experiment
with it.

>[...]
>
> I'm not sure whether I should classify groff's behaviour of restarting
> the hyphenation process a feature or a bug (I tend to the latter).
> However, I don't have time to work on that.
>

The wrong case of hyphenation can easily be corrected by creating a
file which bans such cases:

a file with lines (or one file for each type) which match every of
the following regular expressions:

1) .[a-z]4

2) 4[a-z].

> > The cases '16' and '32' (for .hy) may not add hyphenation points,
> > just allow already found ones, if otherwise forbidden.
>
> Nice idea, but impossible to implement without meta-knowledge.  As
> mentioned above, the hyphenation patterns are constructed with certain
> \lefthyphenmin and \righthyphenmin values.  However, those values are
> *not* present in the hyphenation patterns ??? you have to know them (I
> consider this a design bug in TeX).  In other words, only the user
> knows that values 16 or 32 are valid for a given language's
> hyphenation patterns or not.
>

Donald E. Knuth hard-coded the "search limits" to 2 and 3 in his TeX
software 30 years ago.  The used "hyphen.us" file in "groff" is simply
too old.  It should contain these "pattern matching limits" so that the
algorithm knows where to begin and where to end.

And there should be more that just one hyphenation file for some
languages, like one without restrictions and one or two with different
restrictions, if that makes sense.  People preferences are different and
writers of software should provide choices for its users.

>[...]
>
> The algorithm works as expected, there is nothing to fix.  Barring
> still hidden bugs, the problem *is* fixed.

Works as you expected, but not I.  The algorithm goes to far, as it
does not know what kind of string it is dealing with,

a) one word

b) part of a word, and thus tries to find a pattern for that part, which
is already an "invariable string", which is not to be hyphenated, but
output unchanged to the line.

>[...]

>It probably doesn't meet
>

Not quite, but currently fixable from the outside with additional
("theoretical") hyphenation patterns.

--
Bjarni I. Gislason

```