[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: internationalizing syntax files
Re: internationalizing syntax files
Thu, 15 Jun 2006 10:48:48 +0800
On Wed, Jun 14, 2006 at 05:59:31PM -0700, Ben Pfaff wrote:
Either of these formats has the potential pitfall that the host
machine's multibyte string encoding might not be based on ASCII
(or its wide string encoding might not be based on Unicode), so
that (wide) character and (wide) string literals would then be in
the wrong encoding. However, in practice there's only one real
competitor to ASCII, which is EBCDIC. On those systems, if we
chose to support them at all, we could use UTF-EBCDIC.
I don't understand this. Even if pspp's running on some host that has
a totally wierd esoteric character set, the compiler should interpret
literals in that charset. So if I have a line like:
int x = 'A';
Then in ascii, x == 64, in ebcdic x == something else .... Similarly,
int x = L'A';
The only time it'll fall down is if for some reason somebody has
decided to use numeric literals where character or string literals
should have been used.
Here's a summary.
- Needs multibyte support (but at least it's easy)
+ Some code needs to be rewritten (but which?)
+ Efficient storage of European characters
+ Easy interface to existing libraries
+ Less need for multibyte support (well, except that
wchar_t might only be 16 bits)
- All string-handling code must be rewritten (but at
least you can't miss important parts)
- European characters expand 2x to 4x
- Difficult interfaces to existing libraries.
What do you think? I am leaning toward UTF-8, not least because
it is possible to convert to using it in phases. If we switch to
UTF-32, then we have to convert pretty much everything all at
once, because code will not compile or, if it does, will not
work, when char pointers become wchar_t pointers.
Personally, I'm leaning the other way. Largely because, although it
may be more of a quantum leap, I think that any problems that are
introduced are going to be much more obvious with UTF-32. In fact, I
suggest that LESS code will need to be rewritten (much of it will be
simple substitution of typenames and function call names), but like
you say, it does have to be written all at once. With the UTF-8
approach, I predict that subtle problems will remain undiscovered for
a long time, wherease with UTF-32 most will be caught at compile time.
For example flip.c contains code similar to:
make_new_var(const char *name)
char *cp = strchr(name, '\0');
if ( lex_is_id1(*cp) )
In this case, if the first byte in name happens to be part of a
multi-byte sequence, then there's no way the compiler can know that
dereferencing cp this way is inappropriate. There's a lot of pointer
arithmetic and array indexing in the string parsing code, and it'd
have to be carefully audited to have confidence it'll all work for
We don't currently have any developers who use pspp in a non-European
language, so we'd probably only know about bugs when a Japanese user
complains. --- Like you say, at least in UTF-32 one cannot miss the
I don't think that the storage inefficiency of UTF-32 is an issue
these days. Even if it means that 4 times the size of the syntax file
is needed, syntax files are not huge like casefiles. Today memory is
Similarly I cannot conceive that there would be many platforms today
that have a sizeof(wchar_t) of 16 bits. If it does, let's just issue
a warning at configure time.
That leaves the question of interfacing to existing libraries. All
the stdio/stdlib/ctype functions (eg: printf) have existing wchar_t
counterparts. Which particular libraries are you concerned about?
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
Description: Digital signature