[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: i18n proposal
Re: i18n proposal
Sun, 18 Jun 2006 19:09:15 -0700
Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)
John Darrington <address@hidden> writes:
> On Sun, Jun 18, 2006 at 02:50:37PM -0700, Ben Pfaff wrote:
> * String data that occurs in cases is primarily treated as opaque
> octets. Even procedure like SORT CASES that could easily do
> better (by using language-specific collation rules via, e.g.,
> wcscoll()) are documented to use bytewise comparison.
> It's probably documented that way, because it's easier to implement.
> It makes sense to me, that SORT CASES should use the collation of the
> "data locale". Let's at least look into the implications of doing so,
> and perhaps offer it under "enhanced" mode.
> A German wanting to say, select all cities from 'N' to 'Z' might be
> very annoyed to find that pspp ommitted 'Öhringen' (where they had the
> world cup match last week).
I have thought a little about that. I have a few ideas.
First, I don't think changing the default behavior is a good
idea, because it seems like it could be a surprising change. But
I can think of a few other options:
* Add a COLLATE keyword to SORT CASES that tells it to
use proper locale-specific collation rules.
* Add a COLLATE('a','b') function to the expression
syntax and extend SORT CASES to allow an arbitrary
expression to be used.
* Add an XFRM('string') function to the expression
syntax, then document that you can sort based on
locale-specific rules using
SORT CASES BY collate.
(XFRM would be implemented via strxfrm().)
The last of those is kind of nice since you don't actually have
to change the sort algorithm at all.
> * The interface to the output subsystem (that is, primarily the
> functions in output.h and tab.h) should use multibyte strings,
> for these reasons. First, strings passed to the tab_*()
> functions are often fed through gettext() along the way, so
> wide strings would be inconvenient. Second, tables can get
> very large, so wide strings would be wasteful.
> (The ASCII driver might want to change its representation of
> the page to wide strings, though, because this would be an easy
> way for it to support Asian character sets.)
> Reading from the unicode website, there are texts which suggest that
> this would not be the case. Apparently, even in "monospace fonts"
> in the general case, the number of characters is not necessarily
> proportional to the width required to render them. The advice there
> is to use multi-byte representation for all input/output operations.
Are you talking about Unicode Standard Annex #11 (East Asian
Width)? I'm aware of the need to deal with single- and
double-width characters. It would not be too hard to do, seeing
as the wcwidth() function will tell you the width of a character.
I don't think that multi-byte representation would work well for
the ASCII driver's internal representation, because it's
difficult to index a multibyte string based on the number of
(single-)character widths from the left margin, which the ASCII
driver does all the time.
Of course, the output format of the ASCII output driver should be
> Incidently, if the ASCII driver is going to support other character
> sets, then it might want to be changed to a more appropriate name.
Yes, "text" or "plain text" is what I have in mind.
> * Each "struct variable" is split between multibyte and wide
> strings. Variable names are used as part of syntax processing,
> so we will probably want to change "name" to a wide string.
> But the short_name has to remain as it is I think.
> * Finally, what should we pass to setlocale()? I think that we
> should select, with LC_ALL, the "output locale".
> Like you say, there's going to be a lot of locale switching going on,
> and with that comes potentinal for mistakes; mistakes that might
> easily go unnoticed. I suggest that we avoid direct calls to
> setlocale, and implement some wrappers.
Yes, but I want to keep locale switching to as much of a minimum
as we can. I suspect that on some systems it actually causes
libc to go out and read a locale file.
On systems that have newlocale()/uselocale()/freelocale(), we
should use those.
> I've been wondering why pspp currently sets the LC_MONETARY category.
I don't recall. Probably, it seemed harmless, so I chose to set it.
> Another option would be to preset the CCA format based upon the lconv
> struct, and leave the DOLLAR format as is. But this would mean that
> DOLLAR is an unmitigated nuisance in countries with a non-dollar
> currency. I wonder what spss in a European locale does?
"In the PARTIES partition there is a small section called the BEER.
Prior to turning control over to the PARTIES partition,
the BIOS must measure the BEER area into PCR."
--TCPA PC Specific Implementation Specification