[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: internationalizing syntax files
Re: internationalizing syntax files
Fri, 16 Jun 2006 18:32:32 +0800
On Thu, Jun 15, 2006 at 10:29:36AM -0700, Ben Pfaff wrote:
[Pet peeve: of course I know what you mean, but in fact a
"quantum" is the smallest possible amount of something.]
Quantum mechanics never was my forte. However, as I understand it,
the metaphor stems from the fact that a quantum is the smallest
possible amount of energy necessary to move an electron from one shell
in the Bohr model of the atom to the next; and that amount, in
atomic terms, is quite a large amount of energy. Anyway, when I use
the expression "quantum leap", I normally want to convey the idea of:
"operating at a different level".
OK, stipulate for the moment that we decide to move to wide
characters and strings for syntax file. The biggest issue in my
mind is, then, deciding how many assumptions we want to make
about wchar_t. There are several levels. In rough order of
increasingly strong assumptions:
1. Don't make any assumptions. There is no benefit to
this above using "char", because C99 doesn't actually
say that wide strings can't have stateful or
multi-unit encodings. It also doesn't say that the
encoding of wchar_t is locale-independent.
2. Assume that wchar_t has a stateless encoding.
3. Assume that wchar_t has a stateless and
4. Assume that wchar_t is Unicode (one of UCS-2, UTF-16,
UTF-32), and for UTF-16 ignore the possibility of
surrogate pairs. C99 recommends but does not require
use of Unicode for wchar_t. (There's a standard macro
__STDC_ISO_10646__ that indicates this.)
5. Assume that wchar_t is UTF-32.
GCC and glibc conform to level 5. Native Windows conforms to
In the above, I'm assuming that when you say "wchar_t has a stateless
encoding", you mean that the entity reading the stream is
stateless. wchar_t is (on my machine at least) just a typedef to int,
so can't contain any "state" except its face value.
So, that being so, I don't think we need to make any assumptions
beyond level 3. See below for elaboration:
I'm saying that we can't blindly translate syntax files to UTF-8
or UTF-32 unless we also translate all of the string and
character literals that we use in conjunction with them to UTF-8
or UTF-32 also. If the execution character set is Unicode, then
no translation is needed; otherwise, we'd have to call a function
to do that, which is inconvenient and relatively slow.
Surely, the string and character literals are converted to UTF-32 by the
compiler? Just by saying:
const wchar_t str = L"foo";
then str contains a UTF-32 (or whatever the wchar_t encoding for that
platform happens to be). We'd have to change strings like
"REGRESSION" to L"REGRESSION" in command.def and other files in
language/lexer, but that doesn't involve any function calls.
Currently, syntax is read one line at a time, using ds_read_line from
str.c. The way I see it working, is that a wchar_t counterpart to
str.c is created (call it wstr.c). In dws_read_line, the call to
getc(stream) is replaced by getwc(stream). Now the man page for
The behaviour of fgetwc depends on the LC_CTYPE category of the current
In the absence of additional information passed to the fopen call, it
is reasonable to expect that fgetwc will actually read a multibyte
sequence from the stream and then convert it to a wide character.
This "reasonable" expectation seems to be a statement of your
assumption #3 above.
So, let us assume that I'm running PSPP on a machine whose wchar_t
happens to be UTF-32 encoded, and it's native charset is EBCDIC. So
long as my LC_CTYPE encoding specifies EDCDIC, syntax files will be
dutifully converted to UTF-32, and during parsing, compared with
UTF-32 string constants. If, I'm provided with a syntax file, which
is encoded in UTF-8, I can use this file, simply by changing LANG (or
LC_CTYPE) to en_AU.UTF-8 (or similar).
> Similarly I cannot conceive that there would be many platforms today
> that have a sizeof(wchar_t) of 16 bits. If it does, let's just issue
> a warning at configure time.
The elephant in the room here is Windows. If we ever want to
have native Windows support, its wchar_t is 16 bits and that's
unlikely to change as I understand it.
I'm treading outside the bounds of my understanding of unicode
now. But I read a bit of the web site, and from what I can infer,
almost all the glyphs for modern natural languages are located below
65365. The "code points" above that are for ancient languages and
math symbols etc.
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
Description: Digital signature