[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: why are locations dictated by bison?

From: Bruce Lilly
Subject: Re: why are locations dictated by bison?
Date: Fri, 18 Jan 2002 13:58:39 -0500

Hans Aberg wrote:
> In traditional C, \0 is used as a string terminator, so then its is no
> point in giving it special consideration.
> But C++ strings do not have that restriction,

Stop! '\0' is used as a string terminator by convention; any
program that wishes to handle ASCII NUL must use a different
convention regardless of the programming language (and flex
provides both yytext and yyleng). The following complete C
program works just fine:

#include <stdio.h>
int main(int argc, char **argv) { /* output a string containing an embedded NUL 
        char *p; int i;
        for (i=0,p="foo\0bar"; i<7; i++,p++) putc(*p, stdout);
        return 0;

and the following C++ program doesn't:

#include <iostream>
using namespace std;
int main(int argc, char **argv) { /* Premature exasperation */
        cout << "foo\0bar";
        return 0;

so it's clearly not a simple matter of "C++ strings do
not have that restriction". Provided that the lexical
analyzer generator boilerplate code handles ASCII NUL
appropriately there is no reason why it cannot be used
to generate a C language lexical analyzer which handles
NUL.  Flex 2.5.4 works, though some earlier versions
(e.g. 2.5.2) did not.  There is really no difference in
the way C and C++ handle '\0' in strings (i.e. arrays
of char); indeed C++ was first implemented as a
preprocessor which generated C code.  Perhaps by "C++
strings" you didn't really mean strings, but some
structure (called a "class" in order to obfuscate it)
called a "string" (in order to obfuscate IT), but then
you're comparing apples to oranges -- and there's no
reason why similar C structures cannot be used to do
the same thing.  So much for C vs. C++; it's a
non-issue w.r.t. the lexical analyzer/parser interface
return value.

> and Unicode is coming along
> as well.

That also makes no difference to the interface.
And Unicode uses the base ASCII set, including NUL.

> So it might be prudent to treat \0 as a full character in all
> circumstances: Not doing so may cause people to overlook proper treatment
> on this silly little detail.

It's unclear what you mean by a "full" character...
Regarding special handling of characters in a lexical
analyzer generator, that has long been the case; for
example, consider that \n does not match the pattern
. which matches all other characters.  Does that mean
that '\n' is in some way not a "full" character?
Should that part of flex be changed also?

yylex has always been defined as a function that returns
an integer value representing some lexical token or end
of input (and the value zero has been reserved for that
purpose). If a programmer has a need to return a token
for the single character '\0', some integer value is
assigned to a token and that value is returned and used
by the parser. That's what %token and y.tab.h (by that
or any other name) are for.  If '\0' need not be treated
as a special separate token, it can (with a suitable
lexical analyzer generator) simply be placed in yytext
and some appropriate token, e.g. STRING, can be
returned. If you think that the integer return value
from yylex must always map to the value of a single
character, you're thinking too narrowly. The main point
of a lexical analyzer is to return values representing
significant tokens in the input, like STRING, NUMBER,
LESS_THAN, etc., rather than returning input characters
one-by-one.  If you want a yylex() function that
returns a different token for every individual input
character, a simple hand-written lexical analyzer would
be preferable to one generated by a tool like flex, and
so in that case, there's no point in talking about
making changes to flex.

> But then idea that now springs up is that the Flex mode is tuned together
> to Bison, and Bison only. This does not mean that Flex cannot be used with
> the other ones, only that the fine tuning is with Bison.
> The idea with using the YYEOF macro would be prudent rather regardless of
> its value, as one should normally use macros, instead of values.

So, in your opinion, we should have yyparse() return
instead of 0 if no parse errors are encountered?
And in the case of bison,
instead of 2, etc., etc., etc.

On a more serious note, what applies to boilerplate code
applies equally well to the grammar rulesets in the
parser.  One should therefore use a token like NUL in
those rulesets rather than hard-coding '\0' (which
can't appear there anyway). So it doesn't matter what
integer value is assigned to that token (NUL) and
therefore no reason to go through all sorts of contortions
to try to force a particular value (viz. zero).

> So
>   #ifndef YYEOF
>   #define YYEOF
>   #endif
>   ...
>   return YYEOF;
> would not cause any incompatibilities, but would admit those that want to
> change it to do so, plus by using YYEOF explicitly, it would communicate to
> the human reader that this is an end of parser condition.

No, an end of input condition, not end of parser (see, it
doesn't help; it confused you already :-).  Because of
yywrap(), parsing may continue. The human programmer who is
writing an application using a lexical analyzer generator
isn't going to see that macro because it never appears in his
.l file.  It is, however, likely to be seen by the person(s)
maintaining the generator.  In order to do proper regression
testing, the generator will have to be built and tested on
multiple platforms, with multiple compilers using every
significant variation of every macro.  Does the term
"combinatorial explosion" sound familiar?  Adding a single
macro with N possible significant values multiplies the
number of tests by a factor of N (N+1 if you also consider,
as you should, leaving the macro undefined). Flex already
has macros yyterminate and YY_NULL; why do you think yet
another one is necessary?  Here are the relevant definitions:

/* Returned upon end-of-file. */
#define YY_NULL 0

/* No semi-colon after return; correct usage is to write "yyterminate();" -
 * we don't want an extra ';' after the "return" because that will cause
 * some compilers to complain about unreachable statements.
#ifndef yyterminate
#define yyterminate() return YY_NULL

In what way is "Returned upon end-of-file" not already
clear?  You seem to be suggesting that this should instead
be something like:

/* Warning, warning! Danger, Will Robinson!
   YYEOF *must* have the same value that is used
   internally by bison and which is not visible
   outside of the bison-generated parser.
#include <hans.h>       /* defines YYEOF, which is *not* defined in y.tab.h */
/* Warning, warning! Extreme danger, Will Robinson!
   Bison defines YYEOF as 0, unconditionally!
   It is not wrapped in #ifndef!
   If you attempt to redefine it, your compiler
   will vomit all over your shoes!
/* tuning for use with bison, and bison only; all other uses unaffected */
/* Warning, warning! Ultimate danger, Will Robinson!
   If the behavior of bison is changed to accommodate this silliness,
   an additional test for the *version* of bison will have to be made
   here, in the lexical analyzer generator boilerplate code.
#ifndef YYEOF
#define YYEOF 0
#undef YYEOF
#define YYEOF 0
/* Returned upon end-of-file. */

with yyterminate unchanged.  In what way does that more
clearly "communicate to the human reader"?  To me it
seems more obtuse because to find out what is *really*
returned one has to track back through additional
macro definitions and files, determine whether or not
the additional macros have been redefined somewhere and
determine what else might be affected by the additional
header file contents.  And it's unclear where hans.h
will reside, how to avoid naming clashes with headers
used by other software, etc., etc., etc.  What happens
when the .c files generated by bison and flex are
transported to another system -- one without bison or
flex -- and is compiled there?  How does hans.h get
there?  Where does it reside in that case?  How do you
ensure that the value assigned to the YYEOF macro
doesn't conflict with that assigned by other tokens?
Where is hans.h when I use flex with yacc? With byacc?

This "silly little detail" as you call it seems to be
getting bigger and bigger (and sillier).  All for the
sake of avoiding the two lines:

%token NUL

\0 { return NUL; }

I asked why introduce difficulties to avoid that, and
you've given me:

1. C vs. C++
A non-issue. And incorrect.

2. Unicode

3. \0 should be a "full" character
Well, it certainly looks more full than that anorexic
\1... which, of course, has nothing to do with an
integer return value.

4. "fine tuning"

5. macros should be used
Fine. Use %token and NUL as above. Change the name if
you wish. That's an argument in favor of the two-line
approach above.

6. no incompatibilities
Where oh where *does* hans.h reside?

7. better communication
It's an end of parser condition, no, wait...

You haven't answered the question; where's the benefit
to offset the difficulties?  You have succeeded only in
raising additional difficulties, a few non sequiturs,
and one point in favor of the simple two-line approach.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]