[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Groff] Groff, Grohtml and Encodings
From: |
Anton Shepelev |
Subject: |
[Groff] Groff, Grohtml and Encodings |
Date: |
Thu, 14 Oct 2010 19:59:09 +0400 |
Hello all,
I thought I had solved all encoding problems until I
tried to export my documetns into the HTML format.
It seems that my understanding of how groff maps
input charactes into its internal charactes and then
into output glyphs is incomplete. Below I have
described what I was doing and what results I got.
I have a KOI8-R encoded file that has the following
letters, in the hex notation:
F0, C5, D2, D7, D9, CA
I am using the koi8-r.tmac file, which maps these
letters as follows:
----------------------------------
Char hex Char dec Mapped char
----------------------------------
F0 240 \[u041F]
C5 197 \[u0435]
D2 210 \[u0440]
D7 215 \[u0432]
D9 217 \[u044B]
CA 202 \[u0439]
----------------------------------
The values in the third column match the Unicode
codes for the corresponding letters of the Russian
language. When I process this file using the follow-
ing MSDOS batch script
type %1 | groff -mkoi8-r -t -Thtml > %2
groff outputs six (one per each symbol) warning mes-
sages of the form:
stdin:1: warning: can't find special character '<SYMBOL>',
Where <SYMBOL> sequentially assumes the following
values:
u041F, u0435, u0440,
u0432, u044B, u0438_0306,
which is exactly what the corresponding input char-
acters map to except for the last one, which turned
into a composite code for a reason unknown to me.
The resulting html file looks quite correct and con-
tains the following:
<p>Первый</p>
These decimal values correspond to the values of the
internal characters in the table above.
The -mkoi8-r does work correctly, as I have tested
by removing it.
Here's what I do not understand and I would appreci-
ate your help with:
1. I tried to define glyphs for the characters
reported in the abovementioned warnings, in
the ...\font\devhtml\r file like this:
u041F 24 0 0x041F,
but this did not affect either the output or
the warning messages. Aren't these warnings
about missing glyphs in the font file? If they
are, then why didn't my defining the glyphs
for those characers work?
2. Why did the last warning mention the composite
character u0438_0306 instead of the original
u0439, to which it is mapped by the
koi8-r.tmac file?
3. I saw the line "unicode" in the
...\font\devhtml\desc file, but the descrip-
tion of the DESC format does not mention the
possibility of such a line. What does it do?
4. How to set up groff to accept koi8-r-encoded
files and output html pages
a. with the same ecoding,
b. with the UTF8 encoding?
Thank you in advance,
Anton