[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: several messages
From: |
Akim Demaille |
Subject: |
Re: several messages |
Date: |
Wed, 16 Sep 2009 16:32:16 +0200 |
Le 8 sept. 2009 à 01:42, Joel E. Denny a écrit :
My grammar has:
- 185 rules
- 104 terminals
- 58 nonterminals
How many S/R and R/R conflicts? I'm also interested in conflicts
that are
resolved by prec/assoc declarations.
There are no remaining conflicts. The updated log has the figures on
solved conflicts:
- 1143 in lalr/ielr
- 64650 in full lr
It's good to see that the IELR computation time is barely worse than
LALR.
Because IELR didn't split any states, the output files are of course
exactly the same. However, that also means there may not have been
much
for IELR to do, and so the timings may not reveal much about IELR's
complexity. The number of conflicts would provide some insight
here...
but maybe I'm getting off topic a little.
(By the way, I've been meaning to add a test case for Bison that makes
sure the LALR and IELR parser tables are exactly the same for
parse-gram.y. If they ever become different, we'll need to fix
parse-gram.y or start using IELR.)
Good point :)
canonical-lr-all
Execution times (seconds)
IELR(1) Phase 3 : 0.33 (10%) usr 0.01 (10%) sys 0.00
( 0%) wall
IELR(1) Phase 4 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00
( 0%) wall
conflicts : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00
( 0%) wall
parser action tables : 2.19 (65%) usr 0.01 (10%) sys 128.00
(100%) wall
outputing parser : 0.16 ( 5%) usr 0.02 (20%) sys 0.00
( 0%) wall
running m4 : 0.65 (19%) usr 0.05 (50%) sys 0.00
( 0%) wall
freeing : 0.01 ( 0%) usr 0.01 (10%) sys 0.00
( 0%) wall
TOTAL : 3.36 0.10 128.00
Bison 3.49s real 3.37s user 0.11s system
G++ 7.85s real 6.76s user 0.76s system
state 12941
2.7M /tmp/ugrammar.cc
52K /tmp/ugrammar.hh
1.6M /tmp/ugrammar.o
25M /tmp/ugrammar.output
I'm glad to see that most of the extra time does not fall in the
original
parser table computation. It just takes a really long time to
compress
and transform those huge tables into the output code.
Yes, indeed.
canonical-lr-consistent
Execution times (seconds)
IELR(1) Phase 3 : 0.36 ( 2%) usr 0.01 ( 4%) sys 0.00
( 0%) wall
IELR(1) Phase 4 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00
( 0%) wall
conflicts : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00
( 0%) wall
parser action tables : 16.38 (91%) usr 0.05 (18%) sys 0.00
( 0%) wall
outputing parser : 0.34 ( 2%) usr 0.08 (29%) sys 0.00
( 0%) wall
running m4 : 0.89 ( 5%) usr 0.13 (46%) sys 0.00
( 0%) wall
freeing : 0.02 ( 0%) usr 0.01 ( 4%) sys 0.00
( 0%) wall
TOTAL : 18.01 0.28 0.00
Bison 18.64s real 18.02s user 0.30s system
G++ 10.63s real 9.15s user 0.96s system
state 12941
5.6M /tmp/ugrammar.cc
52K /tmp/ugrammar.hh
2.3M /tmp/ugrammar.o
27M /tmp/ugrammar.output
Wow. What a jump. I can see why you don't want this. I have to
admit
that I haven't spent much time exploring the combination of
canonical-lr
and different values of lr.default-reductions. Maybe it's bad that I
defaulted it to "accepting"? I just didn't want anyone to argue
that they
specified canonical LR but didn't get it.
I have no strong opinion about this. After all, someone reading the
doc about lr.type should be ready to read the section about lr.default-
reductions. We have to warn them. My figures are public, if you mean
to use them in some form in the documentation, that would be fine with
me.
canonical-lr-accepting
Execution times (seconds)
reader : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00
( 0%) wall
IELR(1) Phase 3 : 0.37 ( 1%) usr 0.01 ( 2%) sys 0.00
( 0%) wall
IELR(1) Phase 4 : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00
( 0%) wall
conflicts : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00
( 0%) wall
parser action tables : 46.08 (95%) usr 0.21 (34%) sys 0.00
( 0%) wall
outputing parser : 0.61 ( 1%) usr 0.12 (20%) sys 0.00
( 0%) wall
running m4 : 1.28 ( 3%) usr 0.26 (43%) sys 0.00
( 0%) wall
freeing : 0.01 ( 0%) usr 0.01 ( 2%) sys 0.00
( 0%) wall
TOTAL : 48.38 0.61 0.00
Bison 49.65s real 48.39s user 0.63s system
G++ 13.04s real 11.08s user 1.03s system
state 12941
8.5M /tmp/ugrammar.cc
52K /tmp/ugrammar.hh
3.2M /tmp/ugrammar.o
31M /tmp/ugrammar.output
Ouch. Well, more motivation to steer clear of canonical LR.
:)
It would be nice to extend the *.output file with various metrics
about the grammar.
Agreed.
Will add this to TODO.
The syntax error on "1 1" behaves as expected with both accepting
and
consistent (i.e., mute since there are too many possibilities ;).
If you could unmute it (by adding a few more possibilities in the
skeleton), it would be interesting to see if canonical LR gives the
same
expected tokens list.
I was interested in extending it temporarily to compare the expected
tokens for your specific grammar. Of course, I also would like to
see it
fixed in general....
I'll keep this as another TODO item :)
One way out of the fixed arity would be to use "note: " lines, as
Alex
suggested it (and I'm slowly changing my mind on this regard).
I was thinking we were both only opposed to dropping the warning
label.
I like the idea of a special function for formatting submessages.
I'm not
so sure about the "note: " prefix though.
I changed my mind when I saw how current GCCs talk to me.
$ cat /tmp/foo.cc
void foo (float*) {}
void foo (int*){}
int
main()
{
foo(1);
}
$ g++ -Wall /tmp/foo.cc
/tmp/foo.cc: In function 'int main()':
/tmp/foo.cc:7: error: call of overloaded 'foo(int)' is ambiguous
/tmp/foo.cc:1: note: candidates are: void foo(float*) <near match>
/tmp/foo.cc:2: note: void foo(int*) <near match>
and Emacs correctly understands this as a single error.
However, for the generated parser rather than for bison, strings fed
to
yyerror currently don't contain newlines. I'm not sure how
important or
how long-standing that convention has been. If we move to a %error-
report
or something like that, that might be a good opportunity to break the
convention... or to let the user take control.
I think we should still keep separated the action of building the
error message, and "sending" the error message. But indeed, maybe
it's wrong. It's clearly "nicer", more modular, but on the other hand
it may push on the user useless memory-allocation issues that do not
even exist if she uses directly fprintf in her %error-report.
Still... I feel uncomfortable with the fusion of both.
My suspicion would have been that disabling default reductions
almost
always does more good than harm, but I think we need more
exploration.
So if we can't trust static tables, should we actually try each
possible token one after the other, and see if one can be shifted at
some point? Of course without touching the stack, that can be
quite a
pain to write :( But it should not be really different from our
error-
recovery code.
You're reading my mind. :) I think we've realized for a while that
%error-verbose is not perfect.
Yes, I agree.
However, there's one problem with your suggestion. By the time the
syntax
error is reported, it's too late. The parser may have already
performed
some default reductions
Ah! Of course!
Thus, immediately upon fetching the token (it's fine if the fetch
comes
after performing some default reductions in consistent states), the
parser
needs to determine whether the token can eventually be shifted.
That is,
the parser first tries the parse without performing semantic actions,
without manipulating the value and location stacks, and without
losing the
original state stack. If it successfully reaches a shift, it then
repeats
the parse on all the normal stacks. If unsuccessful instead, it
immediately reports a syntax error and then repeats the exploration
for
every token, as you suggested, so it can report what tokens were
expected.
During each such exploration (during parsing or at error recovery),
the
parser need not duplicate the normal parser state stack in order to
avoid
losing the original context. Instead, it keeps a temporary stack that
records only the exploratory states that are not already recorded on
the
normal parser stack. It also keeps a pointer into the normal parser
stack
in order to record the base of the temporary stack. That pointer
may move
during exploratory reductions. I've implemented all this already in
yacc.c, and it functions correctly so far. I'm just tidying up memory
management at the moment, and I need to do more testing.
That sounds exciting!
At worst, LAC should no more than double the parsing time. That is,
even
though it's similar to GLR or backtracking, it tries only one possible
parse on any token up to its shift before performing the parse for
real or
reporting a syntax error. (Also unlike GLR and backtracking, we still
resolve conflicts statically. The parser is still deterministic.) My
suspicion is that, given the time for file I/O, lexical analysis, and
semantic actions, the total run time of the parser will not be
affected
severely by this doubling. Of course, building a syntax error message
will take longer than double, but I'm not as worried about that.
When I
finish implementing, I'll try collecting some statistics.
Now I see what you're using bench.pl for :)
Another interesting point is that, with LAC, IELR should always
perform
exactly the same semantic actions as canonical LR even for
syntactically
incorrect sentences.
Nice feature!
Previously, IELR could only claim that for
syntactically correct sentences. (The same can be said for LALR as
long
as there are no mysterious conflicts, but, unless there are no
conflicts
at all, that's hard to verify without an IELR implementation anyway.)
One can even imagine using LAC in conjunction with a push parser to
implement, for example, tab completions.
I also had this in mind.
Keep in mind that, with LAC, the only possible difference between
lr.default-reductions=all or =consistent for any fixed value of
lr.type
should be parser tables sizes and performance (that is, possibly
many more
reductions to explore before permitting normal parsing), which would
be
interesting to collect statistics on.
It would also be nice to explore the merits at runtime of the table
compressions. I remember that some other tools (maybe Yacc++) offers
this feature. Space is probably no longer an issue, so maybe there
would be some added runtime performances to use raw tables.
BTW, I've not yet done any homework to see if any of this has been
done
before. Now that I write it down, it seems like such a simple
premise, so
maybe it has been done many times. I should probably have
researched it
before banging out an implementation. Maybe there's a better way
already
discovered.
It seems that indeed, many many results were obtained about LR
parsers, but many of them are now somewhat forgotten. Yet, there are
so many papers out there that I don't know what's the right means to
fetch only the relevant ones.
Do I also understand that we neither have a superset nor a subset,
just some fuzzyset?
For expected tokens reported in syntax error messages, that's correct.
It's unsettling, isn't it?
Yes :)
I can send ugrammar.y privately, if you wish.
I'd like to see that. Thanks!
Will do right afterwards. Please, no dirty jokes about my grammar :)
I wish it was nicer in many places...