[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Texmacs-dev] Importing ASCII grave from HTML
[Texmacs-dev] Importing ASCII grave from HTML
Fri, 25 Jul 2003 11:18:56 +0200
There is currently a problem importing HTML pages using the ASCII
Since that problem is really intricate and has _many_ ramfication, I
wrote the whole history down (to help me in my quest for the Right
Thing). Here is what I found. Please share your thoughts and comments.
For easy reference, meaningful paragraphs in this message are
separated in _facts_, ordered with numerals, and _conclusions_ ordered
1. In Unicode this is U+0060. The comments read:
* this is a spacing character
--> 02CB modifier letter grave accent
--> 0300 combining grave accent
--> 2035 reversed prime
So this character is intended to mean "space with grave accent".
2. However, in many pages (notably, including the www.gnu.org pages)
this character is used as LEFT SINGLE QUOTATION MARK (U+2018), and
two successive occurences of this character signify LEFT DOUBLE
QUOTATION MARK (U+201C).
3. We can also note that TeXmacs has the semantic "twice U+2018 is
U+201C" implemented as a ligature. It is also possible to operate
on U+201C as a single character since it is mapped to code #10 in
the Cork encoding.
3b. However, this ligature is only supported in some fonts. For
example, it does not seem to work in Times.
A. So, one could say: okay, let us import GRAVE ACCENT in
non-preformatted text as LEFT SINGLE QUOTATION MARK and enjoy the
nice ligature logic of TeXmacs. This was actually the initially
However this solution has some problems:
4. The GRAVE ACCENT character is #00 in Cork, and TeXmacs has many
bugs when handling strings with NULL chars in them. I fixed all
those bugs in the GUILE glue (otherwise the GNU page would not be
displayed...), but I just found there are still problems with the
Copy command. And probably more will show up in the future.
B. Clearly those bugs need to be fixed.
5. In Cork, LEFT SINGLE QUOTATION MARK is #60 (that is the code of
GRAVE ACCENT in ASCII). So up to now, TeXmacs always behaved as if
they were the same character when using cut-and-paste.
6. To still aggravate the confusion, the GRAVE ACCENT character is
commonly referred to as BACKQUOTE (that is actually LEFT SINGLE
QUOTATION MARK) in programming language parlance. Quasiquoting in
Scheme and process substitution in shell are often called
7. The Cork #00 (GRAVE ACCENT) character does not render at all in
may fonts (notably the Adobe collection).
8. The ligature mechanism appears to be disable in non-proportional
fonts. HTML <PRE> elements ought to be displayed in a
C. So, on second thought, we would need to convert GRAVE ACCENT to
LEFT SINGLE QUOTATION MARK in all visible text.
9. I know of no reasonable use of the GRAVE ACCENT character in
human-readable text. The "character chart" corner case may be
neglected in the case of HTML import. In all other cases, LEFT
SINGLE QUOTATION MARK is the intended character.
D. In round trip HTML conversion (import to TeXmacs, then export to
HTML again) it might seem acceptable to replace all occurences of
GRAVE ACCENT by LEFT SINGLE QUOTATION MARK.
10. It may not be acceptable to convert GRAVE ACCENT to LEFT SINGLE
QUOTATION MARK in HTML attribute values, since that may break
references (in the addmiteddly pathological case of a URI or ID
attribute containing a GRAVE ACCENT character).
11. It is not acceptable to modify the corktounicode.scm conversion
table because that would effect import as well as export (we only
have a problem with import) for all Unicode based document formats
(we only have a problem with HTML). In addition, we would not be
able to claim "unicode compliance" with a clear conscience.
E. A solution would be to add a string substitution for all
element content string nodes in the HTML->TM converter which
replace GRAVE ACCENT by LEFT SINGLE QUOTATION MARK. Attribute
content (and all other special text if we ever come to use it)
would be converted using the standard-compliant utf8->cork
12. The status of the U+0027 (APOSTROPHE-QUOTE) also is not very
clear. Though Unicode clearly states that this character has a
"mixed usage", and that U+2019 is preferred for matching quotes as
well as apostrophe. ASCII #27 is used to mean "single quote" as
well as "apostrophe"
13. LEFT SINGLE QUOTATION MARK has a glyph variant named SINGLE
HIGH-REVERSED-9 QUOTATION MARK (U+201B) and LEFT DOUBLE QUOTATION MARK
has a glyph variant DOUBLE HIGH-REVERSED-9 QUOTATION MARK.
Actually, that is a bit weird because Unicode clearly stated purpose
is to be an encoding for characters, not glyphs. It just happens
that those alternate glyphs more closely match what is displayed by
F. Let us just not care and do as if the DOUBLE HIGH-REVERSED-9 do
14. Gecko and KHTML display GRAVE ACCENT, LEFT SINGLE QUOTATION MARK,
RIGHT SINGLE QUOTATION MARK, etc. in a similar way as TeXmacs.
15. When copy-pasting from Konqueror or Mozilla, non-ASCII
punctuations are not converted to the nearest ASCII punctuation.
As a consequence, copy pasting will generally only work correctly
when using GRAVE ACCENT and APOSTROPHE-QUOTE. This is specially
important for code snippets.
G. Considering how other software handle those quoting punctuations,
there is also a case for not losing the distinction between GRAVE
ACCENT, APOSTROPHE-QUOTE and the U+20xx conterparts.
H. If the screen display is not satisfying, a better solution might
be to fix the way the character GRAVE ACCENT is displayed to make
it graphically equivalent to LEFT SINGLE QUOTATION MARK.
I really cannot find a satisfying solution to this problem.
- [Texmacs-dev] Importing ASCII grave from HTML,
David Allouche <=