[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Texmacs-dev] Importing ASCII grave from HTML

From: David Allouche
Subject: [Texmacs-dev] Importing ASCII grave from HTML
Date: Fri, 25 Jul 2003 11:18:56 +0200
User-agent: Mutt/1.5.4i

There is currently a problem importing HTML pages using the ASCII
"grave" character.

Since that problem is really intricate and has _many_ ramfication, I
wrote the whole history down (to help me in my quest for the Right
Thing). Here is what I found. Please share your thoughts and comments.

For easy reference, meaningful paragraphs in this message are
separated in _facts_, ordered with numerals, and _conclusions_ ordered
with letters.

 1. In Unicode this is U+0060. The comments read:

       * this is a spacing character
       --> 02CB modifier letter grave accent
       --> 0300 combining grave accent
       --> 2035 reversed prime

So this character is intended to mean "space with grave accent".

 2. However, in many pages (notably, including the www.gnu.org pages)
    this character is used as LEFT SINGLE QUOTATION MARK (U+2018), and
    two successive occurences of this character signify LEFT DOUBLE

 3. We can also note that TeXmacs has the semantic "twice U+2018 is
    U+201C" implemented as a ligature. It is also possible to operate
    on U+201C as a single character since it is mapped to code #10 in
    the Cork encoding.

 3b. However, this ligature is only supported in some fonts. For
     example, it does not seem to work in Times.

 A. So, one could say: okay, let us import GRAVE ACCENT in
    non-preformatted text as LEFT SINGLE QUOTATION MARK and enjoy the
    nice ligature logic of TeXmacs. This was actually the initially
    proposed solution.

However this solution has some problems:

 4. The GRAVE ACCENT character is #00 in Cork, and TeXmacs has many
    bugs when handling strings with NULL chars in them. I fixed all
    those bugs in the GUILE glue (otherwise the GNU page would not be
    displayed...), but I just found there are still problems with the
    Copy command. And probably more will show up in the future.

 B. Clearly those bugs need to be fixed.

 5. In Cork, LEFT SINGLE QUOTATION MARK is #60 (that is the code of
    GRAVE ACCENT in ASCII). So up to now, TeXmacs always behaved as if
    they were the same character when using cut-and-paste.

 6. To still aggravate the confusion, the GRAVE ACCENT character is
    commonly referred to as BACKQUOTE (that is actually LEFT SINGLE
    QUOTATION MARK) in programming language parlance. Quasiquoting in
    Scheme and process substitution in shell are often called

 7. The Cork #00 (GRAVE ACCENT) character does not render at all in
    may fonts (notably the Adobe collection).

 8. The ligature mechanism appears to be disable in non-proportional
    fonts. HTML <PRE> elements ought to be displayed in a
    non-proportional font.

 C. So, on second thought, we would need to convert GRAVE ACCENT to
    LEFT SINGLE QUOTATION MARK in all visible text.

 9. I know of no reasonable use of the GRAVE ACCENT character in
    human-readable text. The "character chart" corner case may be
    neglected in the case of HTML import. In all other cases, LEFT
    SINGLE QUOTATION MARK is the intended character.

 D. In round trip HTML conversion (import to TeXmacs, then export to
    HTML again) it might seem acceptable to replace all occurences of

10. It may not be acceptable to convert GRAVE ACCENT to LEFT SINGLE
    QUOTATION MARK in HTML attribute values, since that may break
    references (in the addmiteddly pathological case of a URI or ID
    attribute containing a GRAVE ACCENT character).

11. It is not acceptable to modify the corktounicode.scm conversion
    table because that would effect import as well as export (we only
    have a problem with import) for all Unicode based document formats
    (we only have a problem with HTML). In addition, we would not be
    able to claim "unicode compliance" with a clear conscience.

 E. A solution would be to add a string substitution for all
    element content string nodes in the HTML->TM converter which
    content (and all other special text if we ever come to use it)
    would be converted using the standard-compliant utf8->cork

12. The status of the U+0027 (APOSTROPHE-QUOTE) also is not very
    clear. Though Unicode clearly states that this character has a
    "mixed usage", and that U+2019 is preferred for matching quotes as
    well as apostrophe. ASCII #27 is used to mean "single quote" as
    well as "apostrophe"

13. LEFT SINGLE QUOTATION MARK has a glyph variant named SINGLE
    has a glyph variant DOUBLE HIGH-REVERSED-9 QUOTATION MARK.
    Actually, that is a bit weird because Unicode clearly stated purpose
    is to be an encoding for characters, not glyphs. It just happens
    that those alternate glyphs more closely match what is displayed by

 F. Let us just not care and do as if the DOUBLE HIGH-REVERSED-9 do
    not exist.

    RIGHT SINGLE QUOTATION MARK, etc. in a similar way as TeXmacs.

15. When copy-pasting from Konqueror or Mozilla, non-ASCII
    punctuations are not converted to the nearest ASCII punctuation.
    As a consequence, copy pasting will generally only work correctly
    when using GRAVE ACCENT and APOSTROPHE-QUOTE. This is specially
    important for code snippets.

 G. Considering how other software handle those quoting punctuations,
    there is also a case for not losing the distinction between GRAVE
    ACCENT, APOSTROPHE-QUOTE and the U+20xx conterparts.

 H. If the screen display is not satisfying, a better solution might
    be to fix the way the character GRAVE ACCENT is displayed to make
    it graphically equivalent to LEFT SINGLE QUOTATION MARK.

I really cannot find a satisfying solution to this problem.

Comments anyone?

                                                            -- ddaa

reply via email to

[Prev in Thread] Current Thread [Next in Thread]