help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: how to get left hand side symbol in action


From: Christian Schoenebeck
Subject: Re: how to get left hand side symbol in action
Date: Fri, 10 May 2019 15:11:17 +0200

On Freitag, 10. Mai 2019 07:24:51 CEST Akim Demaille wrote:
> Hi Christian,

Hi Akim!

> Aren't you referring to LA correction, as implemented in Bison?
> 
> https://www.gnu.org/software/bison/manual/html_node/LAC.html

Well no, it is has somewhat in common, but the use cases are differently, see 
below.

> > That's actually a very common required feature in practice. The main use
> > cases are:
> > 
> > 1. Auto completion features
> > 
> > 2. Human friendly error messages
> 
> I think you are referring to the name of the tokens, not all the symbols.
> For the error messages, it makes sense.  Although I am now more convinced
> that most of the time, error messages are more readable when you quote
> exactly the source than when you print token names.

No, I really mean the non-terminal symbols. Because sometimes [not always ;-)] 
I don't use the classic approach of separate lexer and parser, but rather let 
the parser do the lexer job as well, like:

FOO : 'F''O''O' ;

which can avoid complex and error prone push and pop constructs that would be 
required with a separate lexer approach and certain complex grammars. So 
effectively some of the non-terminals (from Bison point of view) are in such 
specific use cases "terminals" from use case point of view. But one of the main 
problems of this approach is that e.g. the default syntax error messages would 
no longer be useful at all. Because you would get a message like:

        Unexpected 'b', expecting 'F'.

Instead of e.g.

        Unexpected 'bla', expecting 'FOO'.

> > I do need these features for almost all parsers, hence for years (since
> > not
> > available directly with Bison) I have a huge bunch of code on top of the
> > internal skeleton code to achieve them.
> 
> Is there available for reading somewhere?

Well, I asked couple years ago on the list if anybody was interested in what I 
am doing to achieve these kinds of features. But got no positive reply, hence 
I did not invest time to write about it in detail yet.

However I am not sure if my approach would be of use for you anyway. Basically 
what I am doing (roughly described) is C++ code on top of the internal 
generated parser tables which replicate what the skeleton LALR(1) parser would 
do. So the algorithm takes a parser state as argument:

        std::vector<YYTYPE_INT16>& stack

and then it walks the possible routes starting from there (accessing the auto 
generated tables directly), that is pushing for each route/branch on stack, 
reducing whenever required etc. And to detect endless recursions while doing 
this, I always track the complete parser history with

        typedef std::set< std::vector<YYTYPE_INT16> > YYStackHistory;

So each entry in this history set is a previous parser symbol stack, and when 
I reach a new parser state I check if the current parser stack already exists 
in that history set and if it does exist already, then it reached an endless 
recursion and I abort the algorithm for that individual branch and then 
continue with the next branch, and so on. That full history tracking might 
take a lot of memory of course though, but eventually the algorithm ends up 
returning a result:

        std::map<String,BisonSymbolInfo>& expectedSymbols

where the result map (not a multi map actually) contains the possible next 
grammar rules for the previously supplied parser state; key being the symbol 
name, value is a struct (BisonSymbolInfo) holding a) the sequence of 
characters expected next for satisfying that grammar symbol and b) a flag 
whether the symbol is (considered as) terminal or a "real" non-terminal (see 
below).

So that C++ code is for me the basis for retrieving the next possible grammar 
rules for any given parser state, which I use both for error messsages as well 
as for auto completion features.

Another problem with my pseudo-terminals (example FOO above): From Bison point 
of view, everything in my grammar are now non-terminals, and I don't want to 
manually mark individual grammar rules to be either considered as "real" non-
terminals and others to be considered as "terminals", So I automated that by 
checking whether the same symbol can be resolved in more than one way, then it 
is a "real" non-terminal, and by checking if the right hand side of the rule 
only contains characters, then it is considered as "terminal. And that 
fundamental information is automatically used to provide appropriate error 
messages and auto completion in a fully automated way.

> Was the feature always fitting perfectly?  Never ever did it result in 
something somewhat incorrect?

I did not make a proof of correctness of the algorithm. As you know you have 
to spend a huge amount of time to really proof this, which I did not. 
Especially when you are working on commercial projects you also have to be 
pragmatic sometimes and also accept the chance of imperfectness for the sake 
of getting forward and weigh costs on usefulness.

The main issue was catching endless recursions that I solved like described 
above. I have not investigated whether there would be cases the algorithm 
would fail, there might be. Keep in mind the use cases for this was: error 
messages and auto completion, not to execute (i.e. fatal) grammar rule 
actions, so the potential harm is limited. Even if the algorithm would fail in 
certain cases, the worst you get is that auto completion does not come up with 
a suggestion in that specific case or the error message being too short. It is 
still an improvement on the overall situation though of not having the 
possibility for these kinds of features at all.

> >> In addition, tokens have several names: the identifier, and the
> >> string name, like
> >> 
> >> %token <string> ID "identifier"
> >> 
> >> Not to mention that I also want to provide support for
> >> internationalization.  So what name should that be? ID?
> >> identifier? or identifiant in French?
> > 
> > In practice you just need the symbol name as is. Nobody needs the
> > translation,
> I beg to disagree.  Nobody should translate the keyword "break",
> but
> 
> > # bison /tmp/foo.y
> > /tmp/foo.y:1.7: erreur: erreur de syntaxe, : inattendu, attendait char ou
> > identifier ou <tag>> 
> >     1 | %token: FOO
> >     
> >       |       ^
> 
> looks stupid; "char", "identifier" and "<tag>" should be translated.

Well, my point was that translation is trivial. An average developer is 
certainly able to solve required translation tasks very easily by custom code 
if required. The other aspects though, looking ahead on a parser states with 
custom code is not trivial. So my point was: on doubt it might make sense to 
concentrate on the mandatory aspect of providing access to the raw symbol 
names and ignoring the translation aspects for now.

Best regards,
Christian Schoenebeck



reply via email to

[Prev in Thread] Current Thread [Next in Thread]