[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: several messages
From: |
Joel E. Denny |
Subject: |
Re: several messages |
Date: |
Mon, 7 Sep 2009 19:42:17 -0400 (EDT) |
User-agent: |
Alpine 1.00 (DEB 882 2007-12-20) |
Hi Akim!
On Mon, 7 Sep 2009, Akim Demaille wrote:
> Le 3 sept. 09 à 07:35, Joel E. Denny a écrit :
> > By compilation, you mean gcc? Or is bison slow?
>
> I meant bison.
Bummer. But I have noticed canonical LR can be a little slower to
generate.
> Here are the figures:
Thanks for all that!
> My grammar has:
> - 185 rules
> - 104 terminals
> - 58 nonterminals
How many S/R and R/R conflicts? I'm also interested in conflicts that are
resolved by prec/assoc declarations.
> lalr-all
>
> Execution times (seconds)
> LALR(1) : 0.01 ( 2%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> outputing parser : 0.01 ( 2%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> running m4 : 0.47 (96%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> TOTAL : 0.49 0.00 0.00
> Bison 0.55s real 0.50s user 0.02s system
> G++ 5.99s real 5.34s user 0.60s system
> state 382
> 164K /tmp/ugrammar.cc
> 52K /tmp/ugrammar.hh
> 836K /tmp/ugrammar.o
> 672K /tmp/ugrammar.output
> ielr-all
>
> Execution times (seconds)
> LALR(1) : 0.01 ( 2%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> IELR(1) Phase 2 : 0.03 ( 6%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> IELR(1) Phase 3 : 0.02 ( 4%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> outputing parser : 0.02 ( 4%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> running m4 : 0.46 (85%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> TOTAL : 0.54 0.00 0.00
> Bison 0.56s real 0.55s user 0.01s system
> G++ 5.98s real 5.34s user 0.59s system
> state 382
> 164K /tmp/ugrammar.cc
> 52K /tmp/ugrammar.hh
> 836K /tmp/ugrammar.o
> 672K /tmp/ugrammar.output
It's good to see that the IELR computation time is barely worse than LALR.
Because IELR didn't split any states, the output files are of course
exactly the same. However, that also means there may not have been much
for IELR to do, and so the timings may not reveal much about IELR's
complexity. The number of conflicts would provide some insight here...
but maybe I'm getting off topic a little.
(By the way, I've been meaning to add a test case for Bison that makes
sure the LALR and IELR parser tables are exactly the same for
parse-gram.y. If they ever become different, we'll need to fix
parse-gram.y or start using IELR.)
> canonical-lr-all
>
> Execution times (seconds)
> IELR(1) Phase 3 : 0.33 (10%) usr 0.01 (10%) sys 0.00 ( 0%) wall
> IELR(1) Phase 4 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> conflicts : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> parser action tables : 2.19 (65%) usr 0.01 (10%) sys 128.00 (100%) wall
> outputing parser : 0.16 ( 5%) usr 0.02 (20%) sys 0.00 ( 0%) wall
> running m4 : 0.65 (19%) usr 0.05 (50%) sys 0.00 ( 0%) wall
> freeing : 0.01 ( 0%) usr 0.01 (10%) sys 0.00 ( 0%) wall
> TOTAL : 3.36 0.10 128.00
> Bison 3.49s real 3.37s user 0.11s system
> G++ 7.85s real 6.76s user 0.76s system
> state 12941
> 2.7M /tmp/ugrammar.cc
> 52K /tmp/ugrammar.hh
> 1.6M /tmp/ugrammar.o
> 25M /tmp/ugrammar.output
I'm glad to see that most of the extra time does not fall in the original
parser table computation. It just takes a really long time to compress
and transform those huge tables into the output code.
> canonical-lr-consistent
>
> Execution times (seconds)
> IELR(1) Phase 3 : 0.36 ( 2%) usr 0.01 ( 4%) sys 0.00 ( 0%) wall
> IELR(1) Phase 4 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> conflicts : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> parser action tables : 16.38 (91%) usr 0.05 (18%) sys 0.00 ( 0%) wall
> outputing parser : 0.34 ( 2%) usr 0.08 (29%) sys 0.00 ( 0%) wall
> running m4 : 0.89 ( 5%) usr 0.13 (46%) sys 0.00 ( 0%) wall
> freeing : 0.02 ( 0%) usr 0.01 ( 4%) sys 0.00 ( 0%) wall
> TOTAL : 18.01 0.28 0.00
> Bison 18.64s real 18.02s user 0.30s system
> G++ 10.63s real 9.15s user 0.96s system
> state 12941
> 5.6M /tmp/ugrammar.cc
> 52K /tmp/ugrammar.hh
> 2.3M /tmp/ugrammar.o
> 27M /tmp/ugrammar.output
Wow. What a jump. I can see why you don't want this. I have to admit
that I haven't spent much time exploring the combination of canonical-lr
and different values of lr.default-reductions. Maybe it's bad that I
defaulted it to "accepting"? I just didn't want anyone to argue that they
specified canonical LR but didn't get it.
> canonical-lr-accepting
>
> Execution times (seconds)
> reader : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> IELR(1) Phase 3 : 0.37 ( 1%) usr 0.01 ( 2%) sys 0.00 ( 0%) wall
> IELR(1) Phase 4 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> conflicts : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
> parser action tables : 46.08 (95%) usr 0.21 (34%) sys 0.00 ( 0%) wall
> outputing parser : 0.61 ( 1%) usr 0.12 (20%) sys 0.00 ( 0%) wall
> running m4 : 1.28 ( 3%) usr 0.26 (43%) sys 0.00 ( 0%) wall
> freeing : 0.01 ( 0%) usr 0.01 ( 2%) sys 0.00 ( 0%) wall
> TOTAL : 48.38 0.61 0.00
> Bison 49.65s real 48.39s user 0.63s system
> G++ 13.04s real 11.08s user 1.03s system
> state 12941
> 8.5M /tmp/ugrammar.cc
> 52K /tmp/ugrammar.hh
> 3.2M /tmp/ugrammar.o
> 31M /tmp/ugrammar.output
Ouch. Well, more motivation to steer clear of canonical LR.
> It would be nice to extend the *.output file with various metrics
> about the grammar.
Agreed.
> >> The syntax error on "1 1" behaves as expected with both accepting and
> >> consistent (i.e., mute since there are too many possibilities ;).
> >
> > If you could unmute it (by adding a few more possibilities in the
> > skeleton), it would be interesting to see if canonical LR gives the
> > same
> > expected tokens list.
I was interested in extending it temporarily to compare the expected
tokens for your specific grammar. Of course, I also would like to see it
fixed in general....
> One way out of the fixed arity would be to use "note: " lines, as Alex
> suggested it (and I'm slowly changing my mind on this regard).
I was thinking we were both only opposed to dropping the warning label.
I like the idea of a special function for formatting submessages. I'm not
so sure about the "note: " prefix though.
However, for the generated parser rather than for bison, strings fed to
yyerror currently don't contain newlines. I'm not sure how important or
how long-standing that convention has been. If we move to a %error-report
or something like that, that might be a good opportunity to break the
convention... or to let the user take control.
> >> So I will proceed with default-reductions=all on space-starving
> >> targets (on
> >> which anyway I use simple error messages to get rid of the tables),
> >> and
> >> "accepting" otherwise.
> >
> > Why not "consistent"? For error messages, I don't know of any
> > reason to
> > expect either to typically be better than the other. But "consistent"
> > produces the traditional Yacc behavior (actually defined in POSIX)
> > of not
> > requesting a token from yylex until necessary. This can affect
> > semantic
> > actions.
>
> I don't see value in this feature in my case. I guess that this is
> precious in languages like C where the token kind of identifiers
> depends on the context.
I can understand that. It doesn't matter for all parsers.
> Wow. You do have a collection of interesting grammars at hand :)
Well, I came up with those toy grammars for that email. I didn't mean to
imply that they were in any way practical or typical. They just prove the
existence of a flaw, and that might later prove to affect the real world.
But maybe I'm reading too much into what you're saying. :)
> > My suspicion would have been that disabling default reductions almost
> > always does more good than harm, but I think we need more exploration.
>
> So if we can't trust static tables, should we actually try each
> possible token one after the other, and see if one can be shifted at
> some point? Of course without touching the stack, that can be quite a
> pain to write :( But it should not be really different from our error-
> recovery code.
You're reading my mind. :) I think we've realized for a while that
%error-verbose is not perfect. However, soon after our current discussion
started, the extent of the problems began to sink in for me, and I
realized the implications for my own work, so I started working on a
solution very similar to what you describe. I'm nearly done with the
coding. I might be able to push a candidate soon.
However, there's one problem with your suggestion. By the time the syntax
error is reported, it's too late. The parser may have already performed
some default reductions (or otherwise invalid reductions due to state
merging) and thus lost the original context recorded on the stack. The
stack popping that error recovery does is not necessarily sufficient to
restore the original stack because reductions may have removed parts of
the stack permanently.
Thus, immediately upon fetching the token (it's fine if the fetch comes
after performing some default reductions in consistent states), the parser
needs to determine whether the token can eventually be shifted. That is,
the parser first tries the parse without performing semantic actions,
without manipulating the value and location stacks, and without losing the
original state stack. If it successfully reaches a shift, it then repeats
the parse on all the normal stacks. If unsuccessful instead, it
immediately reports a syntax error and then repeats the exploration for
every token, as you suggested, so it can report what tokens were expected.
During each such exploration (during parsing or at error recovery), the
parser need not duplicate the normal parser state stack in order to avoid
losing the original context. Instead, it keeps a temporary stack that
records only the exploratory states that are not already recorded on the
normal parser stack. It also keeps a pointer into the normal parser stack
in order to record the base of the temporary stack. That pointer may move
during exploratory reductions. I've implemented all this already in
yacc.c, and it functions correctly so far. I'm just tidying up memory
management at the moment, and I need to do more testing.
I've been calling this lookahead correction technique "LAC" (LookAhead
Correction) for lack of anything better. Maybe LR LAC? The %define
variable is parse.lac for now, but I realize that's yet another
abbreviation.
At worst, LAC should no more than double the parsing time. That is, even
though it's similar to GLR or backtracking, it tries only one possible
parse on any token up to its shift before performing the parse for real or
reporting a syntax error. (Also unlike GLR and backtracking, we still
resolve conflicts statically. The parser is still deterministic.) My
suspicion is that, given the time for file I/O, lexical analysis, and
semantic actions, the total run time of the parser will not be affected
severely by this doubling. Of course, building a syntax error message
will take longer than double, but I'm not as worried about that. When I
finish implementing, I'll try collecting some statistics.
Another interesting point is that, with LAC, IELR should always perform
exactly the same semantic actions as canonical LR even for syntactically
incorrect sentences. Previously, IELR could only claim that for
syntactically correct sentences. (The same can be said for LALR as long
as there are no mysterious conflicts, but, unless there are no conflicts
at all, that's hard to verify without an IELR implementation anyway.)
One can even imagine using LAC in conjunction with a push parser to
implement, for example, tab completions.
Keep in mind that, with LAC, the only possible difference between
lr.default-reductions=all or =consistent for any fixed value of lr.type
should be parser tables sizes and performance (that is, possibly many more
reductions to explore before permitting normal parsing), which would be
interesting to collect statistics on. But there should be no change in
basic parsing behavior or error messages. For canonical LR only, LAC is
redundant unless lr.default-reductions=all. As always,
lr.default-reductions=accepting can alter how semantic actions are
performed because the use of default reductions in consistent states
affects when a token is fetched and thus when LAC starts for that token.
The first approach I thought of for correcting lookaheads was a little
different than what I describe above. Upon fetching each token, the
parser would proceed parsing normally but would begin to push all states
popped from the normal parser stack onto a temporary stack (popped
semantic values and locations would not be stored) and would again keep a
pointer to its base in the normal parser stack. Upon shifting the token,
the temporary stack would be cleared. Upon encountering an error instead,
the temporary stack would enable the parser to remember what the original
state stack looked like upon fetching the token, and yysyntax_error could
then explore the parse for all tokens as in the previous approach to see
which tokens were expected. The main loss with this approach would be
that canonical LR semantic actions would no longer be perfectly emulated
as in the previous approach. The implementation would also be more
complicated. Originally, I thought the gain might be a lower performance
impact before a syntax error is detected, but now I'm not so sure this
approach is much different than doubling the parser actions as in the
previous approach. The main difference seems to be that, in this
approach, the parser tables need not be decoded a second time in order to
manipulate the temporary stack. Well, if the previous approach has poor
performance, I might try this approach just in case it's somehow better.
BTW, I've not yet done any homework to see if any of this has been done
before. Now that I write it down, it seems like such a simple premise, so
maybe it has been done many times. I should probably have researched it
before banging out an implementation. Maybe there's a better way already
discovered.
> Do I also understand that we neither have a superset nor a subset,
> just some fuzzyset?
For expected tokens reported in syntax error messages, that's correct.
It's unsettling, isn't it?
> > so that's
> > what I looked at.
>
> I can send ugrammar.y privately, if you wish.
I'd like to see that. Thanks!