bison-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: several messages


From: Joel E. Denny
Subject: Re: several messages
Date: Mon, 7 Sep 2009 19:42:17 -0400 (EDT)
User-agent: Alpine 1.00 (DEB 882 2007-12-20)

Hi Akim!

On Mon, 7 Sep 2009, Akim Demaille wrote:

> Le 3 sept. 09 à 07:35, Joel E. Denny a écrit :

> > By compilation, you mean gcc?  Or is bison slow?
> 
> I meant bison.

Bummer.  But I have noticed canonical LR can be a little slower to 
generate.

> Here are the figures:

Thanks for all that!

> My grammar has:
> - 185 rules
> - 104 terminals
> - 58 nonterminals

How many S/R and R/R conflicts?  I'm also interested in conflicts that are 
resolved by prec/assoc declarations.

> lalr-all
> 
> Execution times (seconds)
>  LALR(1)               :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  outputing parser      :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  running m4            :   0.47 (96%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  TOTAL                 :   0.49             0.00             0.00
> Bison 0.55s real  0.50s user  0.02s system
> G++   5.99s real  5.34s user  0.60s system
> state 382
> 164K /tmp/ugrammar.cc
>  52K /tmp/ugrammar.hh
> 836K /tmp/ugrammar.o
> 672K /tmp/ugrammar.output

> ielr-all
> 
> Execution times (seconds)
>  LALR(1)               :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  IELR(1) Phase 2       :   0.03 ( 6%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  IELR(1) Phase 3       :   0.02 ( 4%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  outputing parser      :   0.02 ( 4%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  running m4            :   0.46 (85%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  TOTAL                 :   0.54             0.00             0.00
> Bison 0.56s real  0.55s user  0.01s system
> G++   5.98s real  5.34s user  0.59s system
> state 382
> 164K /tmp/ugrammar.cc
>  52K /tmp/ugrammar.hh
> 836K /tmp/ugrammar.o
> 672K /tmp/ugrammar.output

It's good to see that the IELR computation time is barely worse than LALR.  
Because IELR didn't split any states, the output files are of course 
exactly the same.  However, that also means there may not have been much 
for IELR to do, and so the timings may not reveal much about IELR's 
complexity.  The number of conflicts would provide some insight here... 
but maybe I'm getting off topic a little.

(By the way, I've been meaning to add a test case for Bison that makes 
sure the LALR and IELR parser tables are exactly the same for 
parse-gram.y.  If they ever become different, we'll need to fix 
parse-gram.y or start using IELR.)

> canonical-lr-all
> 
> Execution times (seconds)
>  IELR(1) Phase 3       :   0.33 (10%) usr   0.01 (10%) sys   0.00 ( 0%) wall
>  IELR(1) Phase 4       :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  conflicts             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  parser action tables  :   2.19 (65%) usr   0.01 (10%) sys 128.00 (100%) wall
>  outputing parser      :   0.16 ( 5%) usr   0.02 (20%) sys   0.00 ( 0%) wall
>  running m4            :   0.65 (19%) usr   0.05 (50%) sys   0.00 ( 0%) wall
>  freeing               :   0.01 ( 0%) usr   0.01 (10%) sys   0.00 ( 0%) wall
>  TOTAL                 :   3.36             0.10           128.00
> Bison 3.49s real  3.37s user  0.11s system
> G++   7.85s real  6.76s user  0.76s system
> state 12941
> 2.7M /tmp/ugrammar.cc
>  52K /tmp/ugrammar.hh
> 1.6M /tmp/ugrammar.o
>  25M /tmp/ugrammar.output

I'm glad to see that most of the extra time does not fall in the original 
parser table computation.  It just takes a really long time to compress 
and transform those huge tables into the output code.

> canonical-lr-consistent
> 
> Execution times (seconds)
>  IELR(1) Phase 3       :   0.36 ( 2%) usr   0.01 ( 4%) sys   0.00 ( 0%) wall
>  IELR(1) Phase 4       :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  conflicts             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  parser action tables  :  16.38 (91%) usr   0.05 (18%) sys   0.00 ( 0%) wall
>  outputing parser      :   0.34 ( 2%) usr   0.08 (29%) sys   0.00 ( 0%) wall
>  running m4            :   0.89 ( 5%) usr   0.13 (46%) sys   0.00 ( 0%) wall
>  freeing               :   0.02 ( 0%) usr   0.01 ( 4%) sys   0.00 ( 0%) wall
>  TOTAL                 :  18.01             0.28             0.00
> Bison 18.64s real  18.02s user  0.30s system
> G++   10.63s real  9.15s user  0.96s system
> state 12941
> 5.6M /tmp/ugrammar.cc
>  52K /tmp/ugrammar.hh
> 2.3M /tmp/ugrammar.o
>  27M /tmp/ugrammar.output

Wow.  What a jump.  I can see why you don't want this.  I have to admit 
that I haven't spent much time exploring the combination of canonical-lr 
and different values of lr.default-reductions.  Maybe it's bad that I 
defaulted it to "accepting"?  I just didn't want anyone to argue that they 
specified canonical LR but didn't get it.

> canonical-lr-accepting
> 
> Execution times (seconds)
>  reader                :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  IELR(1) Phase 3       :   0.37 ( 1%) usr   0.01 ( 2%) sys   0.00 ( 0%) wall
>  IELR(1) Phase 4       :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  conflicts             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
>  parser action tables  :  46.08 (95%) usr   0.21 (34%) sys   0.00 ( 0%) wall
>  outputing parser      :   0.61 ( 1%) usr   0.12 (20%) sys   0.00 ( 0%) wall
>  running m4            :   1.28 ( 3%) usr   0.26 (43%) sys   0.00 ( 0%) wall
>  freeing               :   0.01 ( 0%) usr   0.01 ( 2%) sys   0.00 ( 0%) wall
>  TOTAL                 :  48.38             0.61             0.00
> Bison 49.65s real  48.39s user  0.63s system
> G++   13.04s real  11.08s user  1.03s system
> state 12941
> 8.5M /tmp/ugrammar.cc
>  52K /tmp/ugrammar.hh
> 3.2M /tmp/ugrammar.o
>  31M /tmp/ugrammar.output

Ouch.  Well, more motivation to steer clear of canonical LR.

> It would be nice to extend the *.output file with various metrics  
> about the grammar.

Agreed.

> >> The syntax error on "1 1" behaves as expected with both accepting and
> >> consistent (i.e., mute since there are too many possibilities ;).
> >
> > If you could unmute it (by adding a few more possibilities in the
> > skeleton), it would be interesting to see if canonical LR gives the  
> > same
> > expected tokens list.

I was interested in extending it temporarily to compare the expected 
tokens for your specific grammar.  Of course, I also would like to see it 
fixed in general....

> One way out of the fixed arity would be to use "note: " lines, as Alex  
> suggested it (and I'm slowly changing my mind on this regard).

I was thinking we were both only opposed to dropping the warning label.  
I like the idea of a special function for formatting submessages.  I'm not 
so sure about the "note: " prefix though.

However, for the generated parser rather than for bison, strings fed to 
yyerror currently don't contain newlines.  I'm not sure how important or 
how long-standing that convention has been.  If we move to a %error-report 
or something like that, that might be a good opportunity to break the 
convention... or to let the user take control.

> >> So I will proceed with default-reductions=all on space-starving  
> >> targets (on
> >> which anyway I use simple error messages to get rid of the tables),  
> >> and
> >> "accepting" otherwise.
> >
> > Why not "consistent"?  For error messages, I don't know of any  
> > reason to
> > expect either to typically be better than the other.  But "consistent"
> > produces the traditional Yacc behavior (actually defined in POSIX)  
> > of not
> > requesting a token from yylex until necessary.  This can affect  
> > semantic
> > actions.
> 
> I don't see value in this feature in my case.  I guess that this is  
> precious in languages like C where the token kind of identifiers  
> depends on the context.

I can understand that.  It doesn't matter for all parsers.

> Wow.  You do have a collection of interesting grammars at hand :)

Well, I came up with those toy grammars for that email.  I didn't mean to 
imply that they were in any way practical or typical.  They just prove the 
existence of a flaw, and that might later prove to affect the real world. 
But maybe I'm reading too much into what you're saying.  :)

> > My suspicion would have been that disabling default reductions almost
> > always does more good than harm, but I think we need more exploration.
> 
> So if we can't trust static tables, should we actually try each  
> possible token one after the other, and see if one can be shifted at  
> some point?  Of course without touching the stack, that can be quite a  
> pain to write :(  But it should not be really different from our error- 
> recovery code.

You're reading my mind.  :)  I think we've realized for a while that 
%error-verbose is not perfect.  However, soon after our current discussion 
started, the extent of the problems began to sink in for me, and I 
realized the implications for my own work, so I started working on a 
solution very similar to what you describe.  I'm nearly done with the 
coding.  I might be able to push a candidate soon.

However, there's one problem with your suggestion.  By the time the syntax 
error is reported, it's too late.  The parser may have already performed 
some default reductions (or otherwise invalid reductions due to state 
merging) and thus lost the original context recorded on the stack.  The 
stack popping that error recovery does is not necessarily sufficient to 
restore the original stack because reductions may have removed parts of 
the stack permanently.

Thus, immediately upon fetching the token (it's fine if the fetch comes 
after performing some default reductions in consistent states), the parser 
needs to determine whether the token can eventually be shifted.  That is, 
the parser first tries the parse without performing semantic actions, 
without manipulating the value and location stacks, and without losing the 
original state stack.  If it successfully reaches a shift, it then repeats 
the parse on all the normal stacks.  If unsuccessful instead, it 
immediately reports a syntax error and then repeats the exploration for 
every token, as you suggested, so it can report what tokens were expected.  
During each such exploration (during parsing or at error recovery), the 
parser need not duplicate the normal parser state stack in order to avoid 
losing the original context.  Instead, it keeps a temporary stack that 
records only the exploratory states that are not already recorded on the 
normal parser stack.  It also keeps a pointer into the normal parser stack 
in order to record the base of the temporary stack.  That pointer may move 
during exploratory reductions.  I've implemented all this already in 
yacc.c, and it functions correctly so far.  I'm just tidying up memory 
management at the moment, and I need to do more testing.

I've been calling this lookahead correction technique "LAC" (LookAhead 
Correction) for lack of anything better.  Maybe LR LAC?  The %define 
variable is parse.lac for now, but I realize that's yet another 
abbreviation.

At worst, LAC should no more than double the parsing time.  That is, even 
though it's similar to GLR or backtracking, it tries only one possible 
parse on any token up to its shift before performing the parse for real or 
reporting a syntax error.  (Also unlike GLR and backtracking, we still 
resolve conflicts statically.  The parser is still deterministic.)  My 
suspicion is that, given the time for file I/O, lexical analysis, and 
semantic actions, the total run time of the parser will not be affected 
severely by this doubling.  Of course, building a syntax error message 
will take longer than double, but I'm not as worried about that.  When I 
finish implementing, I'll try collecting some statistics.

Another interesting point is that, with LAC, IELR should always perform 
exactly the same semantic actions as canonical LR even for syntactically 
incorrect sentences.  Previously, IELR could only claim that for 
syntactically correct sentences.  (The same can be said for LALR as long 
as there are no mysterious conflicts, but, unless there are no conflicts 
at all, that's hard to verify without an IELR implementation anyway.)  
One can even imagine using LAC in conjunction with a push parser to 
implement, for example, tab completions.

Keep in mind that, with LAC, the only possible difference between 
lr.default-reductions=all or =consistent for any fixed value of lr.type 
should be parser tables sizes and performance (that is, possibly many more 
reductions to explore before permitting normal parsing), which would be 
interesting to collect statistics on.  But there should be no change in 
basic parsing behavior or error messages.  For canonical LR only, LAC is 
redundant unless lr.default-reductions=all.  As always, 
lr.default-reductions=accepting can alter how semantic actions are 
performed because the use of default reductions in consistent states 
affects when a token is fetched and thus when LAC starts for that token.

The first approach I thought of for correcting lookaheads was a little 
different than what I describe above.  Upon fetching each token, the 
parser would proceed parsing normally but would begin to push all states 
popped from the normal parser stack onto a temporary stack (popped 
semantic values and locations would not be stored) and would again keep a 
pointer to its base in the normal parser stack.  Upon shifting the token, 
the temporary stack would be cleared.  Upon encountering an error instead, 
the temporary stack would enable the parser to remember what the original 
state stack looked like upon fetching the token, and yysyntax_error could 
then explore the parse for all tokens as in the previous approach to see 
which tokens were expected.  The main loss with this approach would be 
that canonical LR semantic actions would no longer be perfectly emulated 
as in the previous approach.  The implementation would also be more 
complicated.  Originally, I thought the gain might be a lower performance 
impact before a syntax error is detected, but now I'm not so sure this 
approach is much different than doubling the parser actions as in the 
previous approach.  The main difference seems to be that, in this 
approach, the parser tables need not be decoded a second time in order to 
manipulate the temporary stack.  Well, if the previous approach has poor 
performance, I might try this approach just in case it's somehow better.

BTW, I've not yet done any homework to see if any of this has been done 
before.  Now that I write it down, it seems like such a simple premise, so 
maybe it has been done many times.  I should probably have researched it 
before banging out an implementation.  Maybe there's a better way already 
discovered.

> Do I also understand that we neither have a superset nor a subset,  
> just some fuzzyset?

For expected tokens reported in syntax error messages, that's correct.  
It's unsettling, isn't it?

> > so that's
> > what I looked at.
> 
> I can send ugrammar.y privately, if you wish.

I'd like to see that.  Thanks!

reply via email to

[Prev in Thread] Current Thread [Next in Thread]