[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Filename Encoding
From: |
John Darrington |
Subject: |
Re: Filename Encoding |
Date: |
Wed, 11 Dec 2013 20:23:20 +0100 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Wed, Dec 11, 2013 at 07:38:46AM -0800, Ben Pfaff wrote:
On Wed, Dec 11, 2013 at 09:05:16AM +0100, John Darrington wrote:
> On Tue, Dec 10, 2013 at 12:38:04PM -0800, Ben Pfaff wrote:
>
> I understand now. However, in other places in PSPP, and in
particular
> in syntax and the output engine, we tend to convert everything we
> receive externally into UTF-8 for internal processing, and then
convert
> back to other encodings as necessary. It would be convenient for
some
> purposes to do this for filenames also (e.g. to include file names
in
> output), and it would avoid needing to keep around two pieces of
> information (file name plus encoding) when one (UTF-8 file name)
would
> do.
>
> Do you think that storing file name plus encoding is superior?
>
> Both solutions have advantages and disadvantages.
>
> The converting-all-filenames-to-utf8 solution has two disadvantages that
I
> can see:
>
> *. Unnecessary recoding - often it will be necessary to convert from
"filename encoding"
> to utf8 and then, back to "filename encoding".
Is the concern here about performance, or something else? I doubt that
there is a real performance problem with doing one or two conversions of
a file name, once per file open. Also, on GNU/Linux the filename
encoding is UTF-8 anyway, so there is no actual conversion.
Performance wouldn't be an issue. I was more concerned about clean code. and
programming
effort. Possibility of memory leaks ... and general elegence.
> *. The bigger disadvantage, is that it will be very easy simply to
forget to do
> the necessary conversion. If the programmer forgets - the compiler
won't complain -
> it is just a char * - Passing a struct file_handle * one cannot forget
- there'll
> be a compiler error.
That's true. In data, we use uint8_t instead of char to remind
ourselves that the data is in the dictionary encoding. We could use
int8_t for UTF-8 data, but that doesn't match either libunistring or
glib practice so it would probably cause a lot of friction at
interfaces.
Like you say, I don't think we can do that trick here because of what the
libraries expect.
J'
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://sks-keyservers.net or any PGP keyserver for public key.
signature.asc
Description: Digital signature