pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [pdf-devel] Text module unit tests failing


From: Aleksander Morgado
Subject: Re: [pdf-devel] Text module unit tests failing
Date: Mon, 15 Sep 2008 09:30:29 +0200
User-agent: Thunderbird 2.0.0.16 (X11/20080724)

Hi Karl,

>     The problem here is that we need to know the user's lang/country info 
>     plus the user's encoding, 
> 
> Ok, but why?

Lang/Country information is used right now in some Unicode
uppercase/lowercase conversions with special rules, mainly for Turkish
and Azeri. It can also be used to create 'PDF strings' encoded in
UTF-16BE with embedded language and country information.

User's encoding is used in some text module functions, like
`pdf_text_get_host()' and `pdf_text_set_host()', to manipulate
pdf_text_t variables based on the specific user's encoding.

> 
>     and we try to use the locale configuration for that. So, if we get C
>     or POSIX as locales, what should we do?
> 
> What are the consequences of any particular choice?  I mean, say we make
> the default be en_US for the location and US-ASCII for the encoding,
> what does that mean for users in practice?

The problem here is that PDF files sometimes (old ones I guess) can
store PDF Names (someone correct me if I am wrong, as I read it long
ago) and other variables in the user's encoding. Even if this seems
weird, we should make the effort of trying to assume that the string in
the PDF file is in the same encoding as the user's one, so we somehow
need to know that encoding.

I also think that the filesystem module uses the user's encoding, anyway.

> 
> Also, the locale very often does not specify encoding info.  It's far
> more common to use "de_DE" than to use "de_DE.UTF-8" or whatever.  What
> do you do in that case, when the information is partial?

Well, right now if that whole information is not available, the text
initialization function completely fails :-|

> 
> Anyway, the point is, you can't just give up and say you don't support
> the C/POSIX locale.  That is the single most common used locale,
> especially in scripts.  I expect you are all well aware of that, but
> since the question arose ...

The places where the user's encoding/lang/country are used are not many,
at least within the text module. The text module should work in most of
the standard operations, so I guess that we could make those parameters
completely optional, or just use a default configuration if any of them
is not found. Should be enough for the internal operation of the text
module, but as that information is also accessed through the API of the
module, I don't know how this problem will be managed in other modules.

> 
> The important thing is what the software does for users (either
> end-users or library users).  Whether tests fail or not is basically
> inconsequential.  Of course if you're testing some lang/enc-specific
> thing, it can't work in C/POSIX.  So if the test reports failure, that
> is expected, and therefore the test was successful :).  Just fix the
> test framework to take care of that.

Well, that's another point. Right now we're getting the locale equal to
'C' in the testing framework, even if the user's locale is fully
configured with lang, country and encoding. That should be changed.
Jose, did you talk to gnulib maintainers about this?

What do you guys think if we make those parameters somehow 'optional'
for the text module? Is it better to assume a default en_US.UTF-8 or
en_US.US-ASCII? After Karl's hints, it is clear that any of those two
approaches should be considered, isn't it?

Thanks Karl,

-Aleksander




reply via email to

[Prev in Thread] Current Thread [Next in Thread]