Let's Play: Use the Source, Luke! (was: .ie as target of .if)

groff
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Let's Play: Use the Source, Luke! (was: .ie as target of .if)

From:	G. Branden Robinson
Subject:	Let's Play: Use the Source, Luke! (was: .ie as target of .if)
Date:	Sun, 27 Sep 2020 17:44:28 +1000
User-agent:	NeoMutt/20180716
Hi, Dave!

At 2020-09-17T12:03:31-0500, Dave Kemper wrote:
> Consider the much simpler example:
> 
> .if 0 .if 1 \{\
> .tm foo
> .\}
> .tm bar
> 
> Following your explanation, the interpreter would evaluate ".if 0",
> decide it was false, and ignore the rest of the line, thus missing
> that the line ends in a \{.  Therefore it would go to the next line,
> and -- unaware that it's inside an opening brace, since it never "saw"
> it -- execute the ".tm foo" request.  Proceeding to the next line, it
> encounters an unbalanced closing brace, which it silently ignores (you
> can verify that it doesn't care about mismatched closing braces by
> duplicating that line as many times as you please in the input file).
> Finally, it hits the last line and emits "bar" on stderr.
> 
> But that's not what happens.  Groff does not print "foo" to stderr,
> which can only happen if it does in fact process the opening brace --
> which is associated with a request (the second .if) that it never
> looks at.  This implies that, at least in some circumstances, the
> interpreter recognizes opening braces as flow-control structures, and
> scans for them even in code it would otherwise never examine.
> 
> The .ie request is just as much a language flow-control element as an
> opening brace, yet (per my original question) the interpreter does not
> treat them the same, ignoring the .ie request in a position (after a
> false conditional) where it does not ignore an opening brace.  And the
> opening brace is associated with the ".if 1", not the ".if 0", so it's
> not as simple as a special case of looking for such a brace
> immediately following a false conditional.  It is, in fact, looking
> BEYOND where it would have needed to look just to find the .ie request
> of my first example.
> 
> Again, if this is considered "working as designed," it should be
> documented as such, but it's not clear to me just how to document it.
> Tadziu's suggestion does not account for the opening-brace exception.
> 
> And are there other exceptions?  And why are there exceptions at all?

I'm far from an expert on the groff parser, but I have studied it a bit
and made _small_ changes.

I can think of two reasons there are exceptions to your model:

(1) Ease of maintenance of a hand-written recursive-descent parser; and
(2) No lookahead.  troff has to operate as a Unix filter.  It can store
all the state it wants but it must act on the most recent character it
has read.

> It seems like a more consistent (and, not incidentally, easier to
> document) language design to handle all flow-control constructs the
> same way: it either unilaterally ignores them after an .if that
> evaluates to false, or unilaterally scans ahead to see whether any
> occur later on the line.  Instead, the behavior seems arbitrary and
> capricious -- which *can* be documented, but still isn't a good
> language design.

Well, let's go to the source.  What we need is a few functions from
src/roff/troff/input.cpp:

do_if_request()   (by far the longest)
if_else_request()
if_request()
else_request()

The reason we have two handlers for "if" is that the actual if-handling
logic has two call sites; one, if_request(), is dispatched when an ".if"
request is seen on the input.  The other is called by if_else_request().

A key difference between these two functions is that if_request has no
return value (returns void, in C parlance)--just like all *roff request
handlers in GNU troff.  do_if_request() returns an integer.

Another key design feature is a data structure called "int_stack", which
as you may have guessed is simply a stack for integers.  The one of
interest here is called "if_else_stack".

static int_stack if_else_stack;

Let us consider the short, easy functions first.

void if_request()
{
  do_if_request();
}

...as simple as you can get.

void if_else_request()
{
  if_else_stack.push(do_if_request());
}

This is more revealing.  If we have an .ie request, call do_if_request()
_but push its return value onto the integer stack we set up_.

What about the "else" part of our "if-then-else"?

void else_request()
{
  if (if_else_stack.is_empty()) {
    warning(WARN_EL, "unbalanced .el request");
    skip_alternative();
  }

The above is pretty obvious.  If we hit an .el, we'd better have seen an
.ie first.

  else {
    if (if_else_stack.pop())
      skip_alternative();
    else
      begin_alternative();
  }
}

I think we're getting closer to the heart of the discussion here.

In a well-formed groff document, an .el is only encountered after an
.ie, which as seen above pushed the result of the if-conditional onto
the stack.  So when we see .el, we pop that integer value and test its
truthiness.

If the condition was FALSE, we call begin_alternative:

static void begin_alternative()
{
  while (tok.space() || tok.left_brace())
    tok.next();
}

This just throws away space and left brace tokens until it can return.
But that makes sense, if the condition was FALSE, we want to execute the
"body" of the .el.

skip_alternative() has the harder job.  It has to consume the body of
the ELSE in a semi-interpreted way; enough to syntactically find the
end of it, but not actually change the state of the engine with respect
to anything it sees.

Recall that we entered this function from an .el whose body is being
skipped either because the .el was invalid (.el without .ie) or because
the "if" part of an if-else (.ie) was true.  There's one[1] other call
site as we'll get to in a moment.

This is the second-longest function we'll examine in today's excursion.
And it's only 40 lines!

static void skip_alternative()
{
  int level = 0;

We're going to keep track of how many \{ \} escapes are nested.

  // ensure that ".if 0\{" works as expected
  if (tok.left_brace())
    level++;

The above is a special case, as noted.

  int c;
  for (;;) {
    c = input_stack::get(0);
    if (c == EOF)
      break;

That's more mal-formed input handling.

    if (c == ESCAPE_LEFT_BRACE)
      ++level;
    else if (c == ESCAPE_RIGHT_BRACE)
      --level;

I _think_ the above refer to the quasi-interned form in which, for
instance, macro definitions are stored.  In other words, if we see
these, we're reading something was stored in "copy mode".  We're seeing
it because someone called a macro, and its body has been interpolated
into the input stream for us.

The next conditional handles input in...non-copy mode, a thing that no
*roff documentation I have ever seen has a name for.  (This irritates
me.  Inside me there is an Aristotle or a Linnaeus struggling to get
out.))

    else if (c == escape_char && escape_char > 0)
      switch(input_stack::get(0)) {
      case '{':
        ++level;
        break;
      case '}':
        --level;
        break;

At any rate, the last four cases we've seen do obvious things: increase
the nesting level if we've seen some form of open-brace, and decrease it
if we've seen some form of close-brace.

      case '"':
        while ((c = input_stack::get(0)) != '\n' && c != EOF)
        ;

We're still inside that "else if (c == escape_char), so this is handling
a traditional-style roff comment: \" foo.  It runs until the next
newline.

I don't know why \# isn't handled here.  Someone want to try to break
the parser with a test case before I get around to it?

      }
    /*
      Note that the level can properly be < 0, e.g.

        .if 1 \{\
        .if 0 \{\
        .\}\}

      So don't give an error message in this case.
    */
    if (level <= 0 && c == '\n')
      break;

The DevTeam thinks of everything!

More importantly, this break takes us out of the for loop when we leave
more scopes than we entered, or see the newline at the end of the
current braceless scope.

  }
  tok.next();

And there's the magic.  We're still inside that "for (;;)", so we just
eat tokens forever until forced to break out of the loop.

}

End of function.

At this point I'm finding myself wanting dinner, so I'll be a bit of a
dick and leave the ~140 line do_if_request() as an exercise for the
reader.  But actually I think above answered the question on point.

Also, a lot of the following function is tied up with implementing the
*roff conditionals, ".if d", ".if r", and so on, so it's not interesting
from the perspective of resolving when GNU troff fully interprets
conditional input versus when it doesn't.  Skip to the end for the good
bits.

int do_if_request()
{
  int invert = 0;
  while (tok.space())
    tok.next();
  while (tok.ch() == '!') {
    tok.next();
    invert = !invert;
  }
  int result;
  unsigned char c = tok.ch();
  if (c == 't') {
    tok.next();
    result = !nroff_mode;
  }
  else if (c == 'n') {
    tok.next();
    result = nroff_mode;
  }
  else if (c == 'v') {
    tok.next();
    result = 0;
  }
  else if (c == 'o') {
    result = (topdiv->get_page_number() & 1);
    tok.next();
  }
  else if (c == 'e') {
    result = !(topdiv->get_page_number() & 1);
    tok.next();
  }
  else if (c == 'd' || c == 'r') {
    tok.next();
    symbol nm = get_name(1);
    if (nm.is_null()) {
      skip_alternative();
      return 0;
    }
    result = (c == 'd'
              ? request_dictionary.lookup(nm) != 0
              : number_reg_dictionary.lookup(nm) != 0);
  }
  else if (c == 'm') {
    tok.next();
    symbol nm = get_long_name(1);
    if (nm.is_null()) {
      skip_alternative();
      return 0;
    }
    result = (nm == default_symbol
              || color_dictionary.lookup(nm) != 0);
  }
  else if (c == 'c') {
    tok.next();
    tok.skip();
    charinfo *ci = tok.get_char(1);
    if (ci == 0) {
      skip_alternative();
      return 0;
    }
    result = character_exists(ci, curenv);
    tok.next();
  }
  else if (c == 'F') {
    tok.next();
    symbol nm = get_long_name(1);
    if (nm.is_null()) {
      skip_alternative();
      return 0;
    }
    result = check_font(curenv->get_family()->nm, nm);
  }
  else if (c == 'S') {
    tok.next();
    symbol nm = get_long_name(1);
    if (nm.is_null()) {
      skip_alternative();
      return 0;
    }
    result = check_style(nm);
  }
  else if (tok.space())
    result = 0;
  else if (tok.delimiter()) {
    token delim = tok;
    int delim_level = input_stack::get_level();
    environment env1(curenv);
    environment env2(curenv);
    environment *oldenv = curenv;
    curenv = &env1;
    suppress_push = 1;
    for (int i = 0; i < 2; i++) {
      for (;;) {
        tok.next();
        if (tok.newline() || tok.eof()) {
          warning(WARN_DELIM, "missing closing delimiter");
          tok.next();
          curenv = oldenv;
          return 0;
        }
        if (tok == delim
            && (compatible_flag
            || input_stack::get_level() == delim_level))
          break;
        tok.process();
      }
      curenv = &env2;
    }
    node *n1 = env1.extract_output_line();
    node *n2 = env2.extract_output_line();
    result = same_node_list(n1, n2);
    delete_node_list(n1);
    delete_node_list(n2);
    curenv = oldenv;
    have_input = 0;
    suppress_push = 0;
    tok.next();
  }
  else {
    units n;
    if (!get_number(&n, 'u')) {
      skip_alternative();
      return 0;
    }
    else
      result = n > 0;
  }
  if (invert)
    result = !result;
  if (result)
    begin_alternative();
  else
    skip_alternative();
  return result;
}

Regards,
Branden
signature.asc
Description: PGP signature
[Prev in Thread]
Current Thread
[Next in Thread]
.ie as target of .if, Dave Kemper, 2020/09/01
- Re: .ie as target of .if, Tadziu Hoffmann, 2020/09/01
  - Re: .ie as target of .if, John Gardner, 2020/09/02
    - Re: .ie as target of .if, Dave Kemper, 2020/09/17
    - Let's Play: Use the Source, Luke! (was: .ie as target of .if), G. Branden Robinson <=
    - Re: Let's Play: Use the Source, Luke! (was: .ie as target of .if), John Gardner, 2020/09/27
Prev by Date: Re: (off topic?) Docbook? Re: manlint?
Next by Date: Re: Let's Play: Use the Source, Luke! (was: .ie as target of .if)
Previous by thread: Re: .ie as target of .if
Next by thread: Re: Let's Play: Use the Source, Luke! (was: .ie as target of .if)
Index(es):
- Date
- Thread