[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Lexer woes

From: Ben Pfaff
Subject: Re: Lexer woes
Date: Tue, 23 Sep 2008 21:19:26 -0700
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)

John Darrington <address@hidden> writes:

> Sometime in the future though, I think we're going to need something
> better.  In particular, I think it'd be really great to be able to do
> command line completion, both for pspp and psppire (I understand spss
> v17 does it already).  So for example given a  partial syntax like
> if the user hits tab at this point, she'll get a list of the variable
> names in the dictionary (but only those which are appropriate, ie not
> scratch, string etc).  Similarly, for any given command, it'd be good
> to have a list of subcommands valid for that command.

I also think this would be nice to have.  It is a problem that I
did some work on a few years ago.  It is not too hard to write a
parser generator that accepts a context-free grammar for PSPP
syntax and outputs C code to parse it, and in fact I did most of
the work necessary for that.

The tricky part, which was what stymied me at the time, is in
fact how you pass the resulting parse tree back to the command
that wants it in a useful form.  If you do it in any of the ways
that were obvious to me, it takes a lot of code to traverse the
parse tree, verify its semantics, and translate it into a form
that is useful for further processing.  In the cases that I
looked at, it takes about as much code to do this, in fact, as it
does to write a parser by hand.  And that is not much of a win.

But I have some newer ideas now that might make it much easier.
If you have time to work on this and you want to hear some of my
ideas, or to look over the work-in-progress parser generator code
that I wrote, then please say so.

> But back to the current issue, parsing the K-W as three tokens, whilst
> will work for the purpose of syntax verification, obviously falls down
> in the bigger picture.  The obvious solution would have been to allow
> '-' as  a valid character in the T_ID token.  However this means that
> constructs like
> suddenly get misinterpreted.  But so far as I can see, there are only
> a few special places in spss syntax where algebraic expressions like
> that can occur (in an IF, LOOP, COMPUTE, RECODE and a few others). I
> wonder if it might not be a better solution to throw the lexer into a
> different mode when an expression is expected.  Obviously there will
> be complications (like when to switch back to non-expression mode).

I do not think that this is the right solution to this particular
problem.  Keywords that include '-' are very rare, but the use of
identifiers in other circumstance is very common.  We would have
to add special cases for variable names, file handle names,
vector names, etc. to disallow the use of '-', and we would gain
very little.

I think that a better solution would be to write a new function
for the lexer that understands how to match a hyphenated word.
For example (this is untested and can probably be improved):

    /* If the lexer is positioned at the (pseudo)identifier S, which
       may contain a hyphen ('-'), skips it and returns true.  Each
       half of the identifier may be abbreviated to its first three
       Otherwise, returns false. */
    lex_match_hyphenated_word (struct lexer *lexer, const char *s)
      const char *hyphen = strchr (s, '-');
      if (hyphen == NULL)
        return lex_match_id (lexer, s);
      else if (lexer->token != T_ID
               || !lex_id_match (ss_buffer (s, hyphen - s), ss_cstr 
               || lex_look_ahead (lexer) != '-')
        return false;
          lex_get (lexer);
          lex_force_match (lexer, '-');
          lex_force_match_id (lexer, hyphen + 1);
          return true; 

Once we have that, it should be easy to teach q2c to use it when
it is necessary.  And then we don't disturb any code that doesn't
need it.

Another approach that I would be happy with is this: in
circumstances where we may want to parse a hyphenated word, call
a special lexer routine to try to add a hyphenated part to the
current identifier.  For example (this is also untested):

    /* If LEXER's current token is an identifier and it is followed
       in the input by a hyphen and a series of letters, appends
       those characters to the token.  The effect is that the
       identifier may now be an entire hyphenated name,
       e.g. T-TEST. */
    lex_hyphenate_token (struct lexer *lexer) 
      size_t len, n;

      if (lexer->token != T_ID
          || lexer->prog == NULL
          || lexer->prog[0] != '-'
          || !isalpha ((unsigned char) lexer->prog[1]))

      /* Count number of bytes of hyphenated name. */
      for (n = 2; lexer->prog[n] != '\0'; n++)
        if (!isalpha ((unsigned char) lexer->prog[n]))

      /* Append hyphenated part to lexer->tokstr. */
      ds_put_substring (&lexer->tokstr, ss_buffer (lexer->prog, n));

      /* Append hyphenated part to lexer->tokid. */
      len = strlen (lexer->tokid);
      str_copy_buf_trunc (&lexer->tokid[len], sizeof lexer->tokid - len,
                          lexer->prog, n);
      lexer->prog += n;

q2c could also be easily modified to call lex_hyphenate_token()
where necessary.
"Sanity is not statistical."
--George Orwell

reply via email to

[Prev in Thread] Current Thread [Next in Thread]