[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: status of UTF-8 support?
From: |
John Darrington |
Subject: |
Re: status of UTF-8 support? |
Date: |
Tue, 26 Oct 2010 15:46:49 +0000 |
User-agent: |
Mutt/1.5.18 (2008-05-17) |
On Tue, Oct 26, 2010 at 10:40:29AM +0000, John Darrington wrote:
On Mon, Oct 25, 2010 at 07:51:56PM -0700, Ben Pfaff wrote:
Rob Messer <address@hidden> writes:
> What is the current status of support for including UTF-8 characters
> in PSPP output? My company is using the Perl interface to import
> survey data into PSPP, and generally it works very well. However,
> we've never been able to use it when our dataset includes labels and
> records in languages like Japanese and Chinese. I know there have
> been some recent updates to PSPP, so last week we upgraded to 0.7.5
> and tried that, but it still didn't seem to work for our test
Japanese
> and Chinese data. Is it supposed to be supported? And if not in
> 0.7.5, perhaps in the latest development snapshot? Thanks,
John Darrington and I talked about this briefly in IRC this
morning. We didn't know a reason that UTF-8 shouldn't work.
I had another look today and have to modify my opinion. Currently,
non-ascii
characters will not work with the perl module. :(
OK. I've just pushed a quick fix which should address this problem. I tested
this
new version writing UTF8 strings in:
Variable Names;
Variable Labels;
Value Labels (both the key and the value);
Values of string variables.
So now, assuming you have a string variable defined, you can write a string
value using an literal utf8 string like:
# German word for "Cylindrical concrete billboard"
$sysfile->append_case ( ["Litfa??sa??le"]);]);
or using escape sequences like:
# The Chinese representation of the name of the city of Tapei
$sysfile->append_case ( ["\x{53F0}\x{5317}"]);
However, in most real life uses, I image you will not be using string literals,
but will be receiving the data from some other perl module. In this case, what
needs to be done is :
use Encode;
$s = get_string_data_from_some_source ();
$enc = get_encoding_of_string_data ();
$sysfile->append_case ([decode ($enc, $s)]);
As always with i18n things are never without caveats... in particular:
* You must remember that a variable's "width" is the maximum number of BYTES
(not characters).
* For rather convoluted reasons, which you need to read "man Encode" in order
to understand, the code ...
use utf8;
use Encode;
$sysfile->append_case ([decode ('UTF-8', "some-utf8-encoded-string")]);
.... won't work. Instead, you would have to write:
$sysfile->append_case ([decode ('UTF-8', encode ('UTF-8',
"some-utf8-encoded-string"))]);
I haven't had a chance to look at reading non-ascii from a .sav file into perl.
J'
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
signature.asc
Description: Digital signature