[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Generating the input-parser with flex

From: Bruno Haible
Subject: Re: Generating the input-parser with flex
Date: Sat, 22 Jan 2022 14:12:38 +0100

Hi Viktor,

> I started writing a flex-file for
> the task and managed to get it quite functional with only ~100 lines of 
> code

This is an interesting simplification. Flex being generally accepted as
a build tool in GNU, I don't see an a-priori problem with this approach.

> I looked into the gperf input-parser and found it quite excessive to 
> have over
> 1000 lines of code to parse only 24 variables.

Hmm. The lines 46..227 of input.cc smell like they can be replaced by
Flex code. But what about the rest of input.cc?

> So I started writing a 
> flex-file for
> the task and managed to get it quite functional with only ~100 lines of 
> code

The two things to watch out for, while making this change, are:
  * Preserve the "loose parsing" principle. The gperf input allows sections
    of unspecified form in several places. It would not be good to
    replace it with a "strict parser" that allows only specific syntaxes
    in these sections.
  * Quality of error diagnostics: We should preserve the quality of the
    diagnostics, or make it better. I have a certain feeling that it
    might well be possible to obtain better diagnostics with a Flex-
    based approach, but it will be code. Not contained in the ~100 lines
    that you have so far.

> Flex uses regular expressions everywhere, so it was quite easy, to even 
> enforce
> certain patterns for the variables. For example the hash-function-name can
> only be set to something that matches [a-zA-Z_][a-zA-Z0-9_]* (the 
> pattern for
> C/C++ identifiers. So something that isn't a valid function-name will be 
> caught
> already before it gets to the compiler and causes errors there.

This is actually NOT desired. You see that currently the
is_define_declaration parser allows the hash-function-name to be just
any word. It may contain non-ASCII characters (which are allowed in
C++ identifiers). What we want is
  - that when a new C or C++ standard comes out, with updated details
    on identifier synax, we don't need to adjust the gperf source code,
  - that people can generate C code and then transform it to D or Rust
    or JavaScript or whatever, with just a 'sed' script.
You kill these abilities when checking the hash-function-name in a
strict way.

> I haven't tested the speed

Speed is not relevant here. The vast majority of the computation time
of gperf is spent in search.cc.

> I also wrote a less strict parser in another flex file. This one allows 
> multiple syntaxes
> for the same functionality. For example you can use %FOO = BAR or %FOO BAR
> or %define FOO BAR or %define FOO = BAR interchangeably.

This is not a useful feature. In general, a programming tool's input should
have *one* preferred syntax, so that developers can use 'grep' with simple
patterns to find what they are looking for, and so that editors can offer
a syntax highlighting with less code.

For example, in Common Lisp, there are several ways to write an assignment:
  (setq variable value)
  (setf variable value)
  (SETF variable value)
  (setF variable value)
This is a misfeature, not a feature. It would be better if there was only
one way.

Likewise, last time I counted, there were 12 ways to define a function
or function pointer in C++. Which is way too much complexity.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]