Re: minor hyphenation issue

From: Werner LEMBERG
Subject: Re: minor hyphenation issue
Date: Wed, 24 May 2017 06:32:52 +0200 (CEST)

> I've no interest in trying to unseat tradition.  What I wondered was
> whether it's practical to create a superset file that, when
> processed to remove non-ASCII lines, generates the historical
> Knuthian pattern file.  This allows unchanged historical
> functionality while not impeding modern relevancy.

In general, this is not possible.  patgen, the program creating
hyphenation patterns, tends to generate patterns that are very
different to the previous ones as soon as you add a few words.[*]

>> 1) Gerard Kuikens created, years ago, a huge set of additional patterns
>> for US English.  [...]
> Does it make sense for groff to use a pattern list that can be
> updated as needed, rather than one frozen by tradition?  Is the one
> cited above a good choice?

It's easy to replace the Knuthian patterns with other US-English
patterns.  And yes, it seems a good idea to use Gerard's patterns.

> On my system,
> texmf-dist/tex/generic/hyph-utf8/patterns/txt/hyph-en-us.pat.txt
> contains only ASCII, while many other files in this directory have
> UTF-8 characters.  This implies to me that there's no technical
> limitation to adding non-ASCII patterns to hyph-en-us.pat.txt -- is
> that accurate?

Yes.  But as mentioned above, you usually have to regenerate the
patterns completely, since adding patterns manually is a black art and
can have unwanted side effects for other words.[*] It's *far* easier
to add another hyphenation exception file that holds words with
non-ASCII characters – I assume there aren't that much, right?


[*] I speak from experience with the German hyphenation patterns that
    are generated from a list of words with about 465000 entries.  It
    really surprises me that there doesn't exist a similar effort
    (i.e., basing the patterns on a known word list) for US

