pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[pdf-devel] Re: Comments to the Encoded Text API


From: jemarch
Subject: [pdf-devel] Re: Comments to the Encoded Text API
Date: Sun, 20 Jan 2008 17:13:47 +0100
User-agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/23.0.50 (i686-pc-linux-gnu) MULE/5.0 (SAKAKI)

   >    5. pdf_text_get_best_encoding function will need specific system
   >    functions to get the range of unicode covered by each host encoding,
   >    and if no such function is available in a given operating system, a
   >    default unicode encoding will be returned.
   >
   > Remember that this function should return an encoding _actually
   > supported_ by the host. If the host support Unicode encoding then it
   > always be the best encoding available. If it is not the case the
   > function cannot return an unicode encoding.
   >
   > I think would be good to investigate the availability of the functions
   > you need to determine the range covered by a given host encoding in
   > Unix, GNU, MacosX and Windows (we need to determine the allowed values
   > for pdf_text_host_encoding_t anyway. An email about this follows).

   I don't really understand why someone would like to create a
   pdf_text_t in a host encoding different than the one used in that
   moment by the user/system. Is this really needed? I am talking about
   the functions `pdf_text_new_from_host', `pdf_text_get_best_encoding'
   and so, where a specific host encoding is passed/returned to/by the
   function. Shouldn't the function detect which the host encoding being
   used in the system is and just use it? AFAIK, host encoding is just to
   receive strings from the user and send strings to the user (not to
   store anything in the PDF file, at least in the user's encoding), and
   the user expects strings in a single host encoding, which can even be
   detected once. Am I right?

Note that we are using Unicode encodings to internally store the text
strings. That means we are covering the entire 31-bit wide Unicode
space. 

Now suppose you are using a GNU system. GNU systems support Unicode
encodings, but right now your actual locale uses a Latin1
encoding. If your pdf_text_t variable contain chinese text, for
example, you will need to use a host encoding able to code chinese
characters, if there is one available.

   The easiest way to handle these host encoding conversions in GNU/Linux
   is the wchar_t type and multi-byte functions. The problem is that
   there is no way to get conversions to/from encodings different to the
   one specified in the user's locale. To get those other conversions
   either the locale should be changed in runtime (not a good idea) or
   other utilities like GNU libiconv should be used explicitly.

Hmm. It is not a problem to use GNU libiconv if we are running in a
GNU system, but this would be a problem if running in a POSIX
environment without GNU libraries installed.

   Why not just detect the host encoding once, when the program
   starts, and use that single encoding in all the operations
   involving host encodings (get/set)? That would be perfect to be
   able to use wchar_t and multi-byte functions, with no need to call
   iconv.

   The approach given in the Text Encoding API is quite similar to the
   way things are done in Windows OS, where you first have to ask for the
   specific ANSI Code Page being used (GetACP) and then use that
   identifier in MultiByteToWideChar or WideCharToMultiByte functions.
   The equivalent two-step approach in GNU/Linux would be done with
   nl_langinfo (to get the name of the encoding set in the locale) and
   GNU libiconv for the conversions, so it's possible, but not sure if
   it's really needed.

So the problem remains for POSIX systems, isnt it?.
What do people think about this?




reply via email to

[Prev in Thread] Current Thread [Next in Thread]