groff
[Top][All Lists]

RE: [Groff] TeX's 'line breaking algorithm'

 From: Ted Harding Subject: RE: [Groff] TeX's 'line breaking algorithm' Date: Sat, 10 Mar 2001 12:23:40 -0000 (GMT)

```On 10-Mar-01 Thomas Baruchel wrote:
> On Fri, Mar 09, 2001 at 01:17:34PM +0100, Werner LEMBERG wrote:
>> > Huh?  When groff was first announced one of the big things was the
>> > use of TeX's 'line breaking algorithm', which is by definition a
>> > paragraph algorithm.
>>
>> This is definitely not used.
>
> What is this algorithm ?

The TeX line-breaking algorithm (which really is paragraph-based,
in that the input to the algorithm is the entire text of a
paragraph) is described in detail in Chapter 14 of Knuth's
"The TeXbook". And the best of luck.

One element of line-breaking is where to put the hyphen when
breaking a word at the end of a line. Especially in English,
this can be very difficult to decide algorithmically without
generating ribald laughter (Knuth's example is "the-rapists
who pre-ached on wee-knights").

TeX's hyphenation algorithm is based on a language-specific
hyphenation table, consisting of 1-letter, 2-letter, 3-letter,...
strings interspersed with integers which indicate the acceptability
of hyphenation at the integer's position, e.g.

% Hyphenation patterns for US English.
% These are the standard Plain TeX hyphenation patterns (in hyphen.tex).
.ach4
.af1t
.al3t
.am5at
.an5c
.ang4
.ani5m

A word is broken down into single-letter tokens, which are then
aggregated into longer strings, with interspersed acceptabilities
computed on the basis of the table. When the full word has been
reconstituted, the result will contain indicators of acceptable
hyphenation points. E.g. "concatenation" generates

0c0o2n1c0a0t0e1n2a1t2i0o0n0

and hyphenation is acceptable at odd integers: con-cate-na-tion.

This algorithm is explained in Appendix H of "The TeXbook" and
is, believe it or not, easier to follow than the line-breaking
algorithm.

Because even something as fancy as this will give bad results
for some specific words, there is a table of exceptions which
is consulted before embarking on the above.

It is the TeX hyphenation algorithm, not the TeX line-breaking
algorithm, which is used in groff (as Werner pointed out).
Groff's hyphenation files (e.g. "hyphen.us" from which the
above was copied) are basically a simplified form of the
corresponding TeX hyphenation file, and a groff file can be derived
from a TeX file.

Groff handles the hyphenation exceptions by means of the
".hw" request (basically the same as in classical UNIX troff):
if you wanted to force "con-cate-na-tion" to hyphenate differently
you could enter, for instance,

.hy con-ca-te-na-tion

After saying all that, I have not directly checked whether the
computational algorithm used in groff is identical to the one
described in Knuth's Appendix H.

The reason groff does not use anything similar to TeX's line-breaking
algorithm is that groff does not embody anything resembling the
concept of a paragraph. Gtroff formats text line by line, using only
the current line length and indent, and the current font metrics, along
with the results from the hyphenation algorithm, and this -- at this
strategic level -- is exactly how UNIX troff also always did it.

The basic algorithm is: Given the printable width (line length
minus indent), pull in words until you first have more than will fit.
Then look at the last word pulled in, and try to hyphenate it.
If it will hyphenate so that the first part fits, output the formatted
line including this part-word, and the remaining part-word is carried
over to the next line. Otherwise the whole non-fitting word is
carried over, and the line (without this word) is formatted to fit
and is output.

The requests in the macro packages which the user _thinks_of_ as
paragraphs do not really embody the paragraph concept either.

What such a request really does is to define a temporary environment
of page offset, line length and indent, and possibly other things,
with possible exceptions for the first line, so that gtroff can
use these in the algorithm just described; and usually a diversion
is set up to receive the line-by-line formatted text subsequently
output by gtroff. When the macro is closed (usually by invoking a
new "paragraph" macro), this diversion is output to the printed
page according to the closing sequence for the macro.

This is why gtroff does not use the TeX line-breaking algorithm.
Knuth says: "TeX chooses breakpoints in an interesting way that
considers each paragraph in its entirety; the closing words of
a paragraph can actually influence the appearance of the first
line." On the other hand, once troff has formatted a line
that's it; apart from the word or part-word which is carried
over to the next line, troff performs no look-ahead, and the
formatted line is immediately output and is not available
for re-formatting.

On the other hand, I believe it is not quite impossible to
implement groff macros which would be capable of much greater
look-ahead: this could be useful for widow-and-orphan control,
and for "vertical filling" (expanding inter-line or inter-paragraph
spacing so that suppressing Ws & Os did not leave blank lines),
and even for improving paragraph layout.

However, there is no direct mechanism in [g]troff for saving
_un_formatted input for re-cycling: the contents of a diversion
are lines which have already been formatted, and a diversion
is really a form of delayed output of formatted text.

The only way I can think of would involve writing out the input
for a paragraph (or page or other unit) to a temporary file
(using the .open/.opena .write .close and .so requests), so
that the temporary file could be repeatedly re-read as input,
under varying troff environments (including track-kerning, changing
minimum inter-word space, ... ), until the formatted results were
satisfactory.

Nevertheless, writing a macro on these lines which could even begin
to approach the sophistication of TeX's approach would be a