groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] hyphenation problems


From: Werner LEMBERG
Subject: [Groff] hyphenation problems
Date: Sun, 04 Feb 2001 20:34:05 +0100 (CET)

Dear friends,


more than a year ago I began to maintain groff, and the following
problem was the very reason why I did so:

> None of this avoids the central issue, which is: why does groff
> suppress hyphenation break at "-" in the context of words with
> escapes like "\ " in them?

To be more specific, consider the following German input file `foo':

  Eingabe-Kodepunkt\ 0xABCD
  Eingabe-Kodepunkt\ 0xABCD
  Eingabe-Kodepunkt\ 0xABCD
  Eingabe-Kodepunkt\ 0xABCD
  Eingabe-Kodepunkt\ 0xABCD

If you say `groff -Tlatin1 foo', you get this:

  Eingabe-Kodepunkt 0xABCD     Eingabe-Kodepunkt 0xABCD    Eingabe-
  Kodepunkt 0xABCD                         Eingabe-Kodepunkt 0xABCD
  Eingabe-Kodepunkt 0xABCD

As you can see, the fourth word (in the second line) isn't hyphenated,
whereas the third word is.  This is definitely a bug, which I believe
is fixed now -- the current groff snapshot produces

  Eingabe-Kodepunkt 0xABCD     Eingabe-Kodepunkt 0xABCD    Eingabe-
  Kodepunkt 0xABCD        Eingabe-Kodepunkt 0xABCD         Eingabe-
  Kodepunkt 0xABCD

(Ruslan, this also fixes the hyphenation problem with boxes you've
encountered).

Nevertheless, the applied changes might have side effects, so I ask
you urgently to test it rigorously with huge volumes of text, checking
whether hyphenation has changed unexpectedly.


It is probably of interest to know exactly when and how GNU troff
hyphenates a line, so here are the rules (this will eventually go into
groff.texinfo).  This will also help you to identify hyphenation
problems.

======================================================================

GNU troff calls the routine environment::possibly_break_line() in the
following cases:

  1. If a space is encountered.

  2. If a newline is encountered (not preceded by `\c').

  3. If a `br' request has been seen.

  4. If a token node is found in the input stream.

     a. This happens for the following objects: "\ ", \:, \|, \^, \?,
        \0, "\,", \a, \b, \d, \D, \h, \l, \L, \o, \r, \t, \u, \v, \x,
        \X, \Y, \z, and \Z.

     b. A diversion resp. box is inserted into the text.  Usually, all
        input in diversions and boxes has already been converted to
        nodes.  The reality is a bit more complicated since there are
        some possibilities to avoid resp. undo this conversion
        (e.g. using `\!' or `.asciify').

possibly_break_line() does nothing if not in fill mode, or if a tab or
field is active, or if inside of a `dummy' environment (e.g. within
.if "..."..." or \w'...').

possibly_break_line() will call environment::hyphenate_line() in the
following cases:

  5. `\p' is found in the input stream in cases 1. and 2.

  6. If the total length of the nodes processed so far minus the width
     of the last node is larger than the text length.  This is the
     normal situation at the end of a line.

hyphenate_line() will do the following:

  7. It searches backwards from the current position for a boundary
     node used as a starting point.  Most of the escape sequences
     listed in 4.a are considered as boundary nodes, together with
     other horizontal and vertical space nodes.

  8. It continues searching backwards for usable nodes until it finds
     another boundary, checking for a leading `\%'.

  9. If no leading `\%' has been encountered, hyphenation codes are
     adjusted if necessary so that the nodes can be added to the
     breakpoint list.

 10. If no leading `\%' has been encountered and the hyphenation flags
     fit, the hyphenation algorithm is applied to the found sequence
     of nodes, building a breakpoint list.

 11. The nodes chain itself into a new list, inserting discretionary
     hyphens according to the breakpoint list resp. whether the
     current character is a hyphenation character (usually `-') or
     `\%' within the scanned node sequence.

Finally, possibly_break_line() will call
environment::choose_breakpoints() to find the best breakpoint from the
new list according to border conditions like hyphenation flags or
hyphenation margin.

======================================================================

These are the old rules.  The bug which I've fixed is in rule 7.

Let's look again at the above example: Here I've marked the points
where GNU troff searches for breakpoints:

  Eingabe-Kodepunkt\ 0xABCD
                   ^       ^
The third word is hyphenated because troff calls possibly_break_line()
at `\ ' which happens to be a boundary character.  The space before
the word is another boundary character, so the sequence
`Eingabe-Kodepunkt' is checked for breakpoints.

The fourth word isn't hyphenated because troff calls
possibly_break_line() after the space which follows the word, and the
previous boundary character is `\ ', thus only the sequence `0xABCD'
is scanned for breakpoints which fails of course.

To fix this, I've changed rule 7 as follows:

  7. It searches backwards from the current position for a boundary
     node used as a starting point.  If hyphenate_line() has been
     called via 4.a or 4.b, use the current node instead as a starting
     point.  Only some of the escape sequences listed in 4.a (usually
     causing vertical movement) and horizontal space nodes are taken
     as boundary nodes.

In the above example, the `\ ' no longer counts as a boundary which
gives the improved result.


    Werner

reply via email to

[Prev in Thread] Current Thread [Next in Thread]