Re: [pdf-devel] Updated tokeniser/parser patch

pdf-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [pdf-devel] Updated tokeniser/parser patch

From:	Michael Gold
Subject:	Re: [pdf-devel] Updated tokeniser/parser patch
Date:	Thu, 22 Jan 2009 16:56:17 -0500
User-agent:	Mutt/1.5.18 (2008-05-17)

On Thu, Jan 22, 2009 at 20:48:24 +0100, address@hidden wrote:
>    The next component to add would be a reader that can read
>    the xref table and resolve indirect object references.
> 
> Before to proceed with the implementation of code pertaining to the
> object layer, we need to complete other activities.

Some parts of the code won't depend too much on the public API. For
example, when opening a PDF file it will be necessary to read the
version comment, check whether it's linearized, read the trailer, etc.

I'm open to changes in the public APIs of anything I've submitted.

> I estimate that by the end of this weekend I will finish the drafts
> for:
...
> - The overall design of the object layer:
> 
>   + What modules are we going to implement, the specific role of each
>     module, and how modules collaborate to implement the public
>     interface. This is not trivial. Among other things, we will have
>     to decide how to implement the garbage collection of indirect
>     objects when saving the document, how to manage the creation of
>     new stream objects (the Acrobat Sdk uses temporary files, for
>     example),

Can you clarify what you mean by the creation of new stream objects? I
don't see what temporary files would be used for. I'd expect the writer
module to read from a pdf_stm_t to get at the filtered data (and use an
indirect integer as the stream dictionary's /Length field). Maybe there
would be a callback function the writer could use to set up that
pdf_stm_t. Of course, it may pull data from a temp file, but I'd think
that would be up to the user of the library.

>     and a large etc. Some decisions in this phase will have
>     a direct impact on the implementation of the parser, for example:
>     do we want to have a separate parser for xref tables?

That would probably be best, since it really isn't in an
operand/operator format like the other parts of the file. The tokeniser
could still be used.

>     can we use
>     the same data type (pdf_obj_t) for both the public interface and
>     to be used by the parser?, etc.

It's obvious to me that we should, since many of the type (like strings)
would end up being identical. One thing I wanted to discuss, though, was
whether it's also OK to store tokens in pdf_obj_t.

The tokeniser produces two types of objects: real objects, like strings
and integers; and tokens like '[' or keywords (such as "stream", and
"null"). I found that it simplifies things to treat them similarly -- a
keyword is implemented in exactly the same way as a name in pdf-obj.c
for example, so I was able to reuse some code and avoid defining a new
struct.

But they're not really the same thing -- maybe it would be better to
define a token structure with a type and value, where the value is a
union including a struct pdf_obj_s.

> Note also that the existing code in src/object/ is by all means
> obsolete and useless. I wrote it too quickly and in the time before we
> started to follow a more "structured" development method.

Some of the changes I made were to make it match the current coding
style, such as _new objects returning a pdf_status_t and filling in a
pointer. Also, I removed the custom array and dict implementations and
changed them to use pdf_list_t and pdf_hash_t.

For now, I removed some of the object copying, and have the container
structures owning their contents and returning references in the _get
methods. I'm not sure whether we'll want to keep this, or to manage the
memory in some other way.

One thing that would reduce memory usage would be to allocate global
objects for null, true, false, and maybe certain keywords and numbers.
pdf_obj_null_new could then just return a pointer to the constant
structure.

>    One annoying thing I noticed was that pdf_list_t needs a heap allocation
>    to use an iterator, which means that pdf_obj_equal_p could fail with an
>    ENOMEM error (but currently has no way to return that error). It would
>    be nice if the iterator could be kept on the stack -- struct
>    pdf_list_iterator_s only contains one member anyway.
> 
> Gerel, what do you think about this? We would of course loose the
> benefits of the opaque pointers in this case, but 'pdf_obj_equal_p'
> throwing PDF_ENOMEM sounds quite weird, and I think that we could make an
> exception and publish the iterator structure.

We wouldn't necessarily lose all the benefits of opaque pointers. Only
the size of the structure needs to be public in order to allocate it on
the stack. The contents don't need to be documented, and we could leave
some padding for future use as recommended in
  http://people.redhat.com/drepper/goodpractice.pdf

>    The gnulib list code will also need to be changed to return ENOMEM when
>    necessary -- currently it just calls abort() when malloc fails.
> 
> Yes, we noticed that. Would be really nice to modify the list module
> to not crash if 'xalloc_die' returns NULL (does nothing). Really, I
> think that it is the only way to use the list module in a library.
> 
> I noticed that you are active in the gnulib development mailing
> list. Would you want to raise the issue there?

Yes, I was planning to raise it once I update c-strtod.

> The (quite simple) parser for type 4 functions is implemented in
> src/base/pdf-fp-func.[ch] and it already provides detection of syntax
> errors and some run time errors.

It's somewhat buggy though. I see numerous places where it will crash on
ENOMEM, and since it uses strtod it will also fail in locales where the
decimal point isn't "." (and I think it will also accept strings like
"NAN" and numbers that use exponent notation, which shouldn't be
accepted). Using common code where possible could make this more
maintainable.

> For 'pdf_fp_func_4_new', since it is only used for type 4 functions,
> the addition of a new parameter of type 'struct
> *pdf_fp_func_4_errors_s' should do it.

Why not just return an error status (pdf_status_t)?

In general, how much detail should be provided when errors are detected?
In my code, I just return PDF_EBADFILE if I detect that the file isn't
valid, since I don't expect users of the library to be interested in
much more -- but I see a number of very specific errors in pdf-errors.h.

-- Michael

signature.asc
Description: Digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

[pdf-devel] Updated tokeniser/parser patch, Michael Gold, 2009/01/21
- Re: [pdf-devel] Updated tokeniser/parser patch, jemarch, 2009/01/22
  - Re: [pdf-devel] Updated tokeniser/parser patch, Michael Gold <=
    - Re: [pdf-devel] Updated tokeniser/parser patch, jemarch, 2009/01/22
    - Re: [pdf-devel] Updated tokeniser/parser patch, Michael Gold, 2009/01/22
    - Re: [pdf-devel] Updated tokeniser/parser patch, jemarch, 2009/01/23
    - Re: [pdf-devel] Updated tokeniser/parser patch, Michael Gold, 2009/01/23
    - Re: [pdf-devel] Updated tokeniser/parser patch, jemarch, 2009/01/25
  - Re: [pdf-devel] Updated tokeniser/parser patch, gerel, 2009/01/22
    - Re: [pdf-devel] Updated tokeniser/parser patch, jemarch, 2009/01/22
    - Re: [pdf-devel] Updated tokeniser/parser patch, Michael Gold, 2009/01/22
    - Re: [pdf-devel] Updated tokeniser/parser patch, gerel, 2009/01/23
    - Re: [pdf-devel] Updated tokeniser/parser patch, jemarch, 2009/01/23

Prev by Date: Re: [pdf-devel] Updated tokeniser/parser patch
Next by Date: Re: [pdf-devel] Updated tokeniser/parser patch
Previous by thread: Re: [pdf-devel] Updated tokeniser/parser patch
Next by thread: Re: [pdf-devel] Updated tokeniser/parser patch
Index(es):
- Date
- Thread