[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: internationalizing syntax files
Re: internationalizing syntax files
Fri, 16 Jun 2006 07:57:55 -0700
Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)
John Darrington <address@hidden> writes:
> On Thu, Jun 15, 2006 at 10:29:36AM -0700, Ben Pfaff wrote:
> OK, stipulate for the moment that we decide to move to wide
> characters and strings for syntax file. The biggest issue in my
> mind is, then, deciding how many assumptions we want to make
> about wchar_t. There are several levels. In rough order of
> increasingly strong assumptions:
> 1. Don't make any assumptions. There is no benefit to
> this above using "char", because C99 doesn't actually
> say that wide strings can't have stateful or
> multi-unit encodings. It also doesn't say that the
> encoding of wchar_t is locale-independent.
> 2. Assume that wchar_t has a stateless encoding.
> 3. Assume that wchar_t has a stateless and
> locale-independent encoding.
> 4. Assume that wchar_t is Unicode (one of UCS-2, UTF-16,
> UTF-32), and for UTF-16 ignore the possibility of
> surrogate pairs. C99 recommends but does not require
> use of Unicode for wchar_t. (There's a standard macro
> __STDC_ISO_10646__ that indicates this.)
> 5. Assume that wchar_t is UTF-32.
> GCC and glibc conform to level 5. Native Windows conforms to
> level 4.
> In the above, I'm assuming that when you say "wchar_t has a stateless
> encoding", you mean that the entity reading the stream is
> stateless. wchar_t is (on my machine at least) just a typedef to int,
> so can't contain any "state" except its face value.
A stateful encoding is one that needs potentially unbounded
look-behind to interpret. For example, ISO-2022 has escape
sequences that change the interpretation of all following bytes.
So, yes, it's the reader of a stateful encoding that needs to
maintain the state, but it's still a fairly common jargon term
despite the minor misnaming.
> So, that being so, I don't think we need to make any assumptions
> beyond level 3. See below for elaboration:
> I'm saying that we can't blindly translate syntax files to UTF-8
> or UTF-32 unless we also translate all of the string and
> character literals that we use in conjunction with them to UTF-8
> or UTF-32 also. If the execution character set is Unicode, then
> no translation is needed; otherwise, we'd have to call a function
> to do that, which is inconvenient and relatively slow.
> Surely, the string and character literals are converted to UTF-32 by the
> compiler? Just by saying:
> const wchar_t str = L"foo";
> then str contains a UTF-32 (or whatever the wchar_t encoding for that
> platform happens to be).
It definitely contains a wchar_t encoding for the "C" locale. If
we make the assumption of a locale-independent encoding for
wchar_t, then it contains a wchar_t encoding of the string for
the current locale too.
> Currently, syntax is read one line at a time, using ds_read_line from
> str.c. The way I see it working, is that a wchar_t counterpart to
> str.c is created (call it wstr.c). In dws_read_line, the call to
> getc(stream) is replaced by getwc(stream).
> This "reasonable" expectation seems to be a statement of your
> assumption #3 above.
Yes, that's the best way to do it.
> So, let us assume that I'm running PSPP on a machine whose wchar_t
> happens to be UTF-32 encoded, and it's native charset is EBCDIC. So
> long as my LC_CTYPE encoding specifies EDCDIC, syntax files will be
> dutifully converted to UTF-32, and during parsing, compared with
> UTF-32 string constants. If, I'm provided with a syntax file, which
> is encoded in UTF-8, I can use this file, simply by changing LANG (or
> LC_CTYPE) to en_AU.UTF-8 (or similar).
> > Similarly I cannot conceive that there would be many platforms today
> > that have a sizeof(wchar_t) of 16 bits. If it does, let's just issue
> > a warning at configure time.
> The elephant in the room here is Windows. If we ever want to
> have native Windows support, its wchar_t is 16 bits and that's
> unlikely to change as I understand it.
> I'm treading outside the bounds of my understanding of unicode
> now. But I read a bit of the web site, and from what I can infer,
> almost all the glyphs for modern natural languages are located below
> 65365. The "code points" above that are for ancient languages and
> math symbols etc.
It seems to depend on who you ask. I've seen claims that some of
the high-plane code points are important, and I've seen claims of
OK, now I have some higher-level i18n issues to raise, so stay
"...I've forgotten where I was going with this,
but you can bet it was scathing."