[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GNU recode 3.6: invalid HTML entity references
From: |
Kevin Rodgers |
Subject: |
Re: GNU recode 3.6: invalid HTML entity references |
Date: |
Wed, 14 Jan 2004 16:27:58 -0700 |
> 1. HTML 2.0 - 4.01 are all defined as SGML applications, and in SGML
> documents "&" is not recognized as an entity reference delimiter
> unless it is immediately followed by a name start character.
> Similarly, "&#" is not recognized as a character reference delimiter
> unless its followed by a name start character or a digit.
>
> 2. SGML defines name start characters as lowercase or uppercase letters,
> but XML (and thus XHTML) adds underscore and colon (solidus). The
> XML additions aren't relevant, though, because that spec also
> requires "&" to be interpreted as a markup delimiter (except within
> comments, processing instructions, and CDATA sections).
recode-3.6/src/html.c:transform_html_ucs2() contains this code to check
the character following '&'
else if ((input_char >= 'A' && input_char <= 'Z')
|| (input_char >= 'a' && input_char <= 'z'))
which isn't correct on systems whose execution character set doesn't
assign consecutive integers to letters, e.g. EBCDIC (see 2.1.3 Character
Encoding, C: A Reference Manual). The usual way around that is to use
the isalpha() etc. predicate declared in <ctype.h>, but POSIX defines
those functions to be dependent on the locale. So should recode use the
code above or isalpha(), and should it call setlocale (LC_CTYPE, "C")
right off the bat to make sure non-ASCII characters aren't considered to
be letters?
--
Kevin