pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[pdf-devel] Re: Comments to the Encoded Text API


From: jemarch
Subject: [pdf-devel] Re: Comments to the Encoded Text API
Date: Tue, 15 Jan 2008 15:33:57 +0100
User-agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/22.0.93 (x86_64-unknown-linux-gnu) MULE/5.0 (SAKAKI)

Hi.

   I have a couple of comments after having checked the API for text
   management (http://gnupdf.org/manuals/gnupdf.html#SEC19).

Many thanks for reviewing the API.

   1. I don't think pdf_text_utf16_val_t, pdf_text_utf32_val_t,
   pdf_text_utf8_val_t and pdf_text_unicode_char_t data types are really
   required, at least in the API. When I implemented conversions to/from
   unicode I defined types like those, but finally decided to skip them
   as they were not really useful.

I agree. Please remove these data types from the Text module
architecture & API.


   2.  Country code and Language code can appear anywhere in a UTF16BE
   encoded PDF string. This means that for a given text more than one
   country code or language code can appear within the data.
   2.a) A first approach would be to define internally the country and
   language code delimiters as end of text markers, so that every
   pdf_text_t handles a single country code/language code. Functions
   involving the creation of pdf_text_t variables from UTF16BE strings
   would need a simple loop to convert the input string chunk by chunk,
   and extra parameters in the API function, something like:
   pdf_text_t pdf_text_new_from_pdf(const char *str, const pdf_size_t
   length, char**remaining, pdf_size_t *remaining_length);
   In this case, if (*remaining_length) is zero, the iteration will
   conclude; if not, a second call to pdf_text_new_from_pdf would be
   needed to create another pdf_text_t with the data starting in
   (*remaining). Using the same function for UTF16BE encoded strings and
   PDFDocEncoding encoded strings is not a problem: to decide wether an
   input string is encoded in UTF16BE or PDFDocEncoding, both the Byte
   Order Marker for UTF16BE (U+FEFF) and the country/language code
   delimiter (U+001B) will be used (the first one will appear in the
   start of every UTF16BE string, and the second one in any UTF16BE chunk
   after the first one if country/language information is available).
   PDFDocEncoded strings won't have any country/language code associated,
   so there won't be any need to split in different pdf_text_t the input
   data.

Please document 2.a) in the architecture and create the appropriate
entries in the reference manual.

   3. I see the need for an extra parameter specifying the length of the
   data array given as input or output in the following functions:
    * pdf_text_new_from_unicode (length of input data array is needed, as
   UTF encodings can have NUL bytes within the string).
    * pdf_text_get_host (length of output data array is needed, as this
   function can involve UTF encodings with NUL bytes within the string)
    * pdf_text_get_unicode (length of output data array is needed, as UTF
   encodings can have NUL bytes within the string)
    * pdf_text_set_host (length of input data array is needed, as this
   function can involve UTF encodings with NUL bytes within the string)
    * pdf_text_set_unicode (length of input data array is needed, as UTF
   encodings can have NUL bytes within the string)

   4. In the same way, size doesn't seem to be needed in
   pdf_text_set_pdf, as PDFDocEncoding should not have any NUL byte
   different than the end of string marker.

Please add the missing parameters.

   5. pdf_text_get_best_encoding function will need specific system
   functions to get the range of unicode covered by each host encoding,
   and if no such function is available in a given operating system, a
   default unicode encoding will be returned.

Remember that this function should return an encoding _actually
supported_ by the host. If the host support Unicode encoding then it
always be the best encoding available. If it is not the case the
function cannot return an unicode encoding.

I think would be good to investigate the availability of the functions
you need to determine the range covered by a given host encoding in
Unix, GNU, MacosX and Windows (we need to determine the allowed values
for pdf_text_host_encoding_t anyway. An email about this follows).

   6. In function pdf_text_new_from_u32, the comment about leading zeros
   I think is useless. If leading zeros are included in an integer
   initialization the compiler will assume that the value is given in
   octal scale, not decimal, so this may be confusing.

Ok. Please remove that comment.

   7. An additional function like pdf_text_clear(pdf_text_t text) is
   needed to free any allocated memory in the variable initializations,
   to really treat pdf_text_t as a black box.

That function should be called `pdf_text_destroy (pdf_text_t text)'.


Thanks.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]