groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: input.cpp, read_size() error reporting


From: Ingo Schwarze
Subject: Re: input.cpp, read_size() error reporting
Date: Fri, 3 Apr 2020 19:31:03 +0200
User-agent: Mutt/1.12.2 (2019-09-21)

Hi Branden,

i agree with almost all you say here (maybe with all, but easy to
miss something given how much it was).

G. Branden Robinson wrote on Thu, Apr 02, 2020 at 08:31:09AM +1100:

> This is of course not a novel problem nor unique to groff; parser
> state recovery in error conditions is an old and notorious problem.

Indeed, which is one of the reasons why UTF-8 is so brilliantly
designed and why mbrtowc(3) and mbstowcs(3) for arbitrary encodings
are such a fiasco, to provide one example not totally unrelated
to text processing, nor to groff.

> Part of my frustration arises from a fear that (1) this area is
> underspecified

My experience is that groff documentation really isn't bad.  I think
Werner is is among the people who did a terrific job with it.  It
is usually detailed and clear, and i usually find what i'm searching
for just in the place where it should be.

> Now is a good time to point to an IETF draft I encountered recently by
> M. Thomson, "The Harmful Consequences of Postel's Maxim"[1].  *roff
> interpreters had a design principle that input streams should be handled
> on a best-effort basis, rather than "fail fast, fail hard".

For a network protocol, where parser leniency directly translates
into attack surface, i don't think anybody still considers the maxim
"be tolerant in what you accept" even remotely acceptable.  There
isn't the slightest doubt that its main consequence is guaranteeing
an interminable string of security issues, and i think that is
universally accepted wisdom nowadays.  It also provides little
benefit because the implementors of network protocols are usually
experienced programmers who won't be daunted by having to adhere
to strict requirements.  Also, the networking stack is written once
and then used a lot, so thorough design, testing, and maintenance
pay off.

For a text processing system, the situation differs.  The input is
written by users who may not even be programmers at all.  They don't
want to focus on the syntax but on the content.  They write a new
document every day, and nobody wants to spend large amounts of
testing on markup of individual documents.  A text processing system
usually isn't run as root, so security requirements usually aren't
quite as strict.  Yes, admitted, if you format a document from an
untrusted source on your personal user account and it does exploit
a parser bug and runs exploit code with your user ID, that is bad,
but there has to be a tradeoff with usuability as a text formatting
system, and that tradeoff is completely different from the tradeoff
for a network stack.

When i do a minor change in my text document and the formatter then
aborts with a fatal error message rather than showing a message
indicating the error *and also showing the formatted result on a
best- effort basis*, then i will certainly be annoyed.  I have
suffered through that a lot, with LaTeX.  So many years ago, i made
the decision that mandoc(1) will never, ever abort formatting with
a fatal error, even though it will show detailed diagnostics when
you ask it to.  To me, that seems a reasonable approach for groff
as well.

> The problem with a best-effort basis is that it can end up turning
> implementation details, dubious decisions, and even bugs, into de facto
> standards.  I think the recent \s39 vs. \s40 debacle is a fantastically
> illustrative example.

Admitted.  However, as long as some behaviour isn't documented, or
is documented as undefined, it can be changed.  The time to apply
the emergency brake is when somebody suggests to document best-effort
behaviour resulting from invalid syntax as if it were valid syntax,
or when somebody suggests to invent surprising or complicated syntax
in other ways.

> In this specific context, that means that as soon
> as an EOF or invalid byte is encountered in an \s escape escape
> sequence, throw a diagnostic (as helpfully as possible, including the
> offending token), and get the hell out of size-escape-parsing state,

So far, i agree, but ...

> re-interpreting the illegal token in the enclosing context.

 ... but i expect doing that consistently will require that you
redesign and rewrite half of the groff parser.  (I'm not absolutely
sure, i haven't read most of the parser, but i did reimplement
significant parts in mandoc, so my guess isn't totally blind.)

> However...
> 
> $ nroff
> .pl 1v
> \sABC
> troff: <standard input>:2: warning: numeric expression expected (got 'B')
> C
> 
> ...that's not where we are today.  We lost not just one but two tokens!

Not true.

  \s - start of an escape sequence
  A  - valid delimiter
  B  - invalid token

So exactly one invalid token is skipped, not two.  As you say, the
parser gets the hell out of size-escape-parsing state and no longer
even looks for the "A" closing delimiter that would be required for
valid syntax.

> These three cases have silent token discards
> without diagnostic:
> 
> \sA
> \s[+
> \s[-

Good find, on first sight, my impression is these might indeed
indicate bug(s).  But that needs detailed inspection.

> Where are (almost) arbitrary delimiters on this form of escape
> documented as accepted?  I'm only accustomed to them in, say, .tl
> requests.  I don't remember this as a general syntactical feature from
> either CSTR #54 or groff(7).

https://www.gnu.org/software/groff/manual/html_node/Escapes.html

  In such cases the argument is traditionally enclosed in single
  quotes (and quotes are always used in this manual for the definitions
  of escape sequences).
  [...]
  Note that the quote character can be replaced with any other
  character that does not occur in the argument (even a newline or
  a space character) in the following escapes:
  [...]
  Finally, the escapes \D, \h, \H, \l, \L, \N, \R, \s, \S, \v, and
  \x can't use the following characters as delimiters:

It would certainly have been possible (and desirable) to design all
that in a simpler way, but it *is* documented, and it is documented
in the proper place.

>> If a token is invalid at some point, it (almost?) always gets eaten,
>> not just in this function.  I think that is a basic design principle
>> of how roff(7) parsing is supposed to work.

> I think that's a guess on your part.

I admit that's indeed what it is.  I formed that opinion while
implementing substantial subsets of the roff(7) language for the
roff-preparser in mandoc, aiming for byte-by-byte output compatibility
with groff -T ascii and groff -T utf8.  Note that i deliberately
refrained from documenting most of the quirks i implemented in the
mandoc documentation, precisely because i think people should not
rely on many of these quirks.

> Another personal commandment I have is to never key the same diagnostic
> string literal into the source code twice.

I agree with that for messages intended for debugging the program
emitting the messages, which means that the audience of the messages
is programmers intending to fix bugs in the program.  The canonical
example are kernel panic(9) messages.

I disagree with that for messages intended to debug *input* to a
program.  In that case, the audience is users, many of whom may not
even be programmers at all, and none of whom care at which point
in the code the message is generated.  If, for internal technical
reasons, the same kind of user error is sometimes detected at one
point in the code, but sometimes at another, both places of the code
have to emit exactly the same error message, or the difference will
merely make the user wonder why the messages differ even though
they made the same mistake.  Also, users would be forced to learn
the meanings of more different messages for no additional benefit.

> the idea of defining a local preprocessor macro to achieve this
> flitted through my head.  I intuited,
> perhaps wrongly, this this would meet resistance.
> 
> I mean something like (please disregard C/C++ style issues):
> 
> #define FETCH_AND_VALIDATE_TOKEN(expected_stuff, error_template) do { \
>    tok.next(); \
>    /* yadda yadda yadda */ \
> } while(0)
> 
> static int read_size(...)
> {
> ...
>    /* many uses of FETCH_AND_VALIDATE_TOKEN() */
> ...
> }
> 
> #undef FETCH_AND_VALIDATE_TOKEN
> 
> where "expected_stuff" is a C++ expression to match on desirable values
> of the token, and "error_template" is a string so we have maximal
> context to tell the user what went wrong if the expectation fails.
> 
> The nice thing about embedding a macro rather than calling to another
> function is that we don't have to pass any state around.
> 
> Was I wrong to conjecture that folks on this list would find the above
> gross?

You were very right.  I would certainly oppose such use of macros.
Not only are they much harder to read and audit than functions,
they also compromize the type safety that the C language provides
and are very harmful when it comes to debugging the code with a
tool like gdb(1).  Keeping code correct requires keeping it simple
and auditable and keeping down complexity and abstraction.

In my experience, error reporting is among the two or three most
dangerous sources of bloat and excessive complexity.  Even in my
own projects, it happened to me several times that i ended up with
message output frameworks or error handling frameworks that had
slowly festered until it became apparent that they were severely
over-engineered.  In this area, the KISS principle and avoiding
wrappers and abstraction matters even more than in average contexts.

Yours,
  Ingo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]