[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [groff] Regularize (sub)section cross references.

From: G. Branden Robinson
Subject: Re: [groff] Regularize (sub)section cross references.
Date: Mon, 17 Dec 2018 13:19:38 -0500
User-agent: NeoMutt/20180716

At 2018-12-17T18:49:31+0100, Tadziu Hoffmann wrote:
> > A 1-to-2 character mapping of course is beyond the ability of .tr.
> You can define a special character:
>   .char \(SS SS
>   .tr mMaAß\(SS
>   Maß
> results in
>   MASS

Nice!  "Typesetting assembly language" once again proves its worth as a
description of Ossanna's roff.

Okay, so maybe this _could_ be rolled out without prerequisites in

My Debian buster system has almost 1600 non-English man pages:

$ find /usr/share/man/!(man*) -type f -and -not -type l|wc -l

Fortunately Chinese, Japanese, and Korean have no case distinctions,
leaving only about 1400 pages:

$ find /usr/share/man/!(man*|ja|ko|zh_*) -type f -and -not -type l|wc -l

and fortunately, _none_ of these use the section-name-on-the-next-line

$ zgrep '^\.[[:space:]]*SH$' $(find /usr/share/man/!(man*) -type f \
        -and -not -type l) || echo NONE

So what do we see in the section headings we _do_ have?  Suppressing
filenames, stripping double-quotes, and just counting occurrences, I
found 1,241 distinct section titles.

$ zgrep -h '^\.[[:space:]]*SH' \
        $(find /usr/share/man/!(man*|ja|ko|zh_*) -type f \
                -and -not -type l) \
        | sed 's/"//g' | sort | uniq -c | wc -l

Interestingly, some are already mixed-case.  This could be true of
English pages as well but that's not the good I'm chasing right now.

The next step would be to bust these down character-by-character, but
that is slightly frustrated by the fact that some people enter their
non-ASCII codepoints as-is and other use (more portable and "correct")
character escapes to obtain them.  Collating and counting these to find
a minimal set of characters to feed .tr requests is going to take a bit
more work.

The file of non-English section headings is attached for the curious.  I
added a sort -nr to the above pipeline and removed the wc -l, of course.


Attachment: section_headers.txt
Description: Text document

Attachment: signature.asc
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]