pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: i18n


From: Ben Pfaff
Subject: Re: i18n
Date: Sun, 19 Mar 2006 20:48:32 -0800
User-agent: Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

John Darrington <address@hidden> writes:

> On Sun, Mar 19, 2006 at 05:26:47PM -0800, Ben Pfaff wrote:
>      >      
>      >      I don't know about the unac library.  What are its advantages
>      >      over iconv?
>      >
>      > Iconv is only useful if we know the source encoding. If we don't know
>      > it we have to guess.  If we guess it wrong, then iconv will fail.
>      > Also, it won't convert between encodings where data would be lost.
>      > Unac on the other hand is a (more) robust but lossy thing.  For 
> example,
>      > given character 0xe1 (acute a) in iso-8859-1 it'll convert to 'a' in
>      > ascii.  I don't know how it would handle converting from Japanese
>      > characters to ascii .... 
>      
>      I do not understand how unac could remove accents from text
>      without knowing the source encoding.  I don't see any indication
>      that it can do so, now that I have read the unac manpage from the
>      webpage you pointed out.  In fact, the first argument to the
>      unac_string() function is the name of the source encoding, and
>      unac is documented to use iconv internally to convert to UTF-16.
>      
>      (Why would we want to remove accents, by the way?)
>
> Ideally we wouldn't.  I've only looked very briefly at the unac web
> page.  As I understood it, it was supposed to convert a string from an
> arbitrary encoding, into a reasonable approximation of that string
> which could be representing in plain ascii.  Perhaps I need to read
> the web page more closely.

I think you should.  As I read it, unac can remove accents from a
string in an arbitrary encoding, but only as long as you can tell
it what encoding, and only as long as your system's iconv can
convert to/from that encoding and UTF-16.  Once it has the string
in UTF-16, removing accents is straightforward, although not
trivial, because the Unicode standard comes with a data table
that explains how to do it.

> I think there is no ideal solution to this problem.  

Definitely agree.

> Your proposal might be as good as any other and certainly is
> simpler than what I had suggested. However I'm worried about
> what happens if our assumption at (ii) turns out to be wrong.
> We need to make sure of some sensible behaviour (hence my idea
> of unac).

(ii) is: "All string data in all casefiles and dictionaries is in
the PSPP locale, or at least we make that assumption."

Sure: we need to be able to convert from whatever encoding it's
actually in to the PSPP locale encoding.  But we can't do that
without knowing (or guessing) the encoding it started in.

I don't know how to guess an encoding.  There does seem to be
some small amount of work on this, based on the results of a web
search for "guess encoding".

It's a separate issue that, once we have a source encoding, the
transformation between them might be lossy.

> Regarding (vi), I don't think spss would complain (at least not
> loudly) about unrecognised records.  But all hell might break loose if
> we commandeared an unused record type for this purpose, and a later
> version of SPSS chose to use it for another purpose.

Yes, it would be a calculated risk that we might not want to take.

> Incidently, SPSS V14 writes system files with a Type 7, Subtype 16
> record.  I haven't been able to determine the purpose of this record.
> Perhaps it specifies the encoding?

Do you have any specimens?  I don't have SPSS v14 to experiment
on.
-- 
"To the engineer, the world is a toy box full of sub-optimized and
 feature-poor toys."
--Scott Adams




reply via email to

[Prev in Thread] Current Thread [Next in Thread]