[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: i18n proposal
Re: i18n proposal
Mon, 19 Jun 2006 09:29:40 +0800
On Sun, Jun 18, 2006 at 02:50:37PM -0700, Ben Pfaff wrote:
Based on the ongoing discussion here, I'm trying to come up with
an acceptable proposal for i18n of syntax files, messages,
output, and data files. Here is what I have so far.
I agree with nearly everything you proposed. Some comments:
* String data that occurs in cases is primarily treated as opaque
octets. Even procedure like SORT CASES that could easily do
better (by using language-specific collation rules via, e.g.,
wcscoll()) are documented to use bytewise comparison.
It's probably documented that way, because it's easier to implement.
It makes sense to me, that SORT CASES should use the collation of the
"data locale". Let's at least look into the implications of doing so,
and perhaps offer it under "enhanced" mode.
A German wanting to say, select all cities from 'N' to 'Z' might be
very annoyed to find that pspp ommitted 'Öhringen' (where they had the
world cup match last week).
* The interface to the output subsystem (that is, primarily the
functions in output.h and tab.h) should use multibyte strings,
for these reasons. First, strings passed to the tab_*()
functions are often fed through gettext() along the way, so
wide strings would be inconvenient. Second, tables can get
very large, so wide strings would be wasteful.
(The ASCII driver might want to change its representation of
the page to wide strings, though, because this would be an easy
way for it to support Asian character sets.)
Reading from the unicode website, there are texts which suggest that
this would not be the case. Apparently, even in "monospace fonts"
in the general case, the number of characters is not necessarily
proportional to the width required to render them. The advice there
is to use multi-byte representation for all input/output operations.
Incidently, if the ASCII driver is going to support other character
sets, then it might want to be changed to a more appropriate name.
* Each "struct variable" is split between multibyte and wide
strings. Variable names are used as part of syntax processing,
so we will probably want to change "name" to a wide string.
But the short_name has to remain as it is I think.
* Finally, what should we pass to setlocale()? I think that we
should select, with LC_ALL, the "output locale".
Like you say, there's going to be a lot of locale switching going on,
and with that comes potentinal for mistakes; mistakes that might
easily go unnoticed. I suggest that we avoid direct calls to
setlocale, and implement some wrappers.
LC_ALL is probably correct.
I've been wondering why pspp currently sets the LC_MONETARY category.
So far as I can tell, there are no functions anwhere in the code that
depend upon it. The obvious place that it might be used is to control
the DOLLAR format, in which case, "DOLLAR" would need to be a
translatable string. On the other hand, if a system file containing
monetary data passes from one locale to another, the currency might
get silently changed, which could cause somebody's fiscal
Another option would be to preset the CCA format based upon the lconv
struct, and leave the DOLLAR format as is. But this would mean that
DOLLAR is an unmitigated nuisance in countries with a non-dollar
currency. I wonder what spss in a European locale does?
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
Description: Digital signature
- i18n proposal, Ben Pfaff, 2006/06/18
- Re: i18n proposal,
John Darrington <=