groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Counterexamples in C programming and library documentation (was: [PATCH


From: G. Branden Robinson
Subject: Counterexamples in C programming and library documentation (was: [PATCH v3] NULL.3const: Add documentation for NULL)
Date: Tue, 2 Aug 2022 14:06:45 -0500

[content warning: yet another long software engineering rant]

At 2022-08-02T13:38:22+0200, Alejandro Colomar wrote:
> On 7/27/22 15:23, Douglas McIlroy wrote:
> > 
> > Incidentally, I personally don't use NULL. Why, when C provides a
> > crisp notation, 0, should one want to haul in an extra include file
> > to activate a shouty version of it?

While I don't endorse the shoutiness of the name selected for the macro
(and also don't endorse the use of a macro over making the null pointer
constant a bona fide language feature as was done in Pascal[1]), as
stronger typing has percolated into the C language--slowly, and against
great resistance from those who defend its original character as a
portable assembly language--distinctions have emerged between ordinal
types, reference types (pointers), and Booleans.

I feel this is unequivocally a good thing.

Yes, people will say that a zero (false) value in all of these types has
exactly the same machine representation, even when it's not true[2], so
a zero literal somehow "tells you more".

But let me point out two C programming practices I avoid and advocate
against, and explain why writing code that way tell us less than many of
us suspect.

(1) Setting Booleans like this.

        nflag++;

    First let me point out how screamingly unhelpful this variable name
    is.  I know the practice came from someone in Bell Labs and was
    copied with slavish dedication by many others, so I'm probably
    slandering a deity here, but this convention is _terrible_.  It
    tells you almost _nothing_.  What does the "n" flag _mean_?  You
    have to look it up.  It is most useful for those who already know
    the program's manual inside out.  Why not throw a bone to people who
    don't, who just happen across the source code?

        numbering++;

    Uh-oh, that's more typing.  And if I got my way you'd type even
    more.

        want_numbering++;

    There, now you have something that actually _communicates_ when
    tested in an `if` statement.

    But I'm not done.  The above exhibit abuses the incrementation
    operator.  Firstly this makes your data type conceptually murky.
    What if `nflag` (or whatever you called it) was already `1`?  Now
    it's `2`.  Does this matter?  Are you sure?  Have you checked every
    path through your code?  (Does your programming language help you to
    do this?  Not if it's C.  <bitter laugh>)  Is a Boolean `2`
    _meaningful_, without programming language semantics piled on to
    coerce it?  No.  Nobody answers "true or false" questions in any
    other domain of human experience with "2".  Nor, really, with "0" or
    "1", except junior programmers who think they're being witty.  This
    is why the addition of a standard Boolean type to C99 was an
    unalloyed good.

    Kernighan and Plauger tried to convince people to select variable
    names with a high information content in their book _The Elements of
    Programming Style_.  Kernighan tried again with Pike in _The
    Practice of Programming_.  It appears to me that some people they
    knew well flatly rejected this advice, and _bad_ programmers have
    more diligently aped poor examples than good ones.  (To be fair,
    fighting this is like fighting entropy.  Ultimately, you will lose.)

    I already hear another objection.

    "But the incrementation operator encodes efficiently!"

    This is a reference to how an incrementation operation is a common
    feature of instruction set architectures, e.g., "INC A".  By
    contrast, assigning an explicit value, as in "ADD A, 1", is in some
    machine languages a longer instruction because it needs to encode
    not just the operation and a destination register but an immediate
    operand.  There are at least two reasons to stop beating this drum.

    (A) Incredibly, a compiler can be written such that it recognizes
    when a subexpression in an addition operation equals one, and can
    branch in its own code to generate "INC A" instead of "ADD A, 1".
    Yes, in the days when you struggled to fit your compiler into
    machine memory this sort of straightforward conditional might have
    been skipped.  (I guess.)  Those days are past.

    (B) If you want to talk about the efficiency of instruction encoding
    you need to talk about fetching.  Before that, actually, you need to
    talk about whether the ISA you're targeting uses constant- or
    variable-length instructions.  The concept of a constant-length
    instruction encoding should ALREADY give the promulgators of the
    above quoted doctrine serious pause.  How much efficiency are you
    gaining in such a case, Tex?  Quantify how much "tighter" the
    machine code is.  Tell the whole class.

    Let's assume we're on a variable-length instruction machine, like
    the horrible, and horribly popular, x86.  Let's get back to
    fetching.  Do you know the memory access impact of (the x86
    equivalent of) "INC A" vs. "ADD A, 1"?[3]  Are you taking
    instruction cache into account?  Pipelining?  Speculative execution,
    that brilliant source of a million security exploits?  (Just
    understanding the basic principle of how speculative execution works
    should, again, give pause to the C programmer concerned with the
    efficiency of instruction encoding.  Today's processors cheerfully
    choke their cache lines with _instructions whose effects they know
    they will throw away_.[4])

    If you can't say "yes" to all of these, have the empirical
    measurements to back them up, had those measurements' precision and
    _validity_ checked by critical peers, _and_ put all this information
    in your code comments, then STOP.

    As a person authoring a program, the details of how a compiler
    translates your code is seldom your business.  Yes, knowledge of
    computer architecture and organization is a tremendous benefit, and
    something any serious programmer should acquire--that's why we teach
    it in school--but _assuming_ that you know with precision how the
    compiler is going to translate your code is a bad idea.  If you're
    concerned about it, you must _check_ your assumption about what is
    happening, and _measure_ whether it's really that important before
    doing anything clever in your code to leverage what you find out.
    If it _is_ important, then apply your knowledge the right
    way: acknowledge in a comment that you're working around an issue,
    state what it is, explain how what you're doing instead is
    effective, and as part of that comment STATE CLEARLY THE CONDITIONS
    UNDER WHICH THE WORKAROUND CAN BE REMOVED.  The same thing goes for
    optimizations.

    Oh, let me rant about optimizations.  First I'll point the reader to
    this debunking[5] of the popularly misconstrued claim of Tony Hoare
    about it (the "root of all evil" thing), which also got stuffed into
    Donald Knuth's mouth, I guess by people who not only found the math
    in TAOCP daunting to follow, but also struggled with the English.
    (More likely, they thoughtlessly repeated what some rock star cowboy
    programmer said to them in the break room or in electronic chat.)

    In my own experience the best optimizations I've done are those
    which dropped code that wasn't doing anything useful at all.
    Eliminating _unnecessary_ work is _never_ a "premature"
    optimization.  It removes sites for bugs to lurk and gives your
    colleagues less code to read.  (...says the guy who thinks nothing
    of dropping a 20-page email treatise on them at regular intervals.)

    And another thing!

    I'd say that our software engineering curriculum is badly deficient
    in teaching its students about linkage and loading.  As a journeyman
    programmer I'd even argue that it's more important to learn about
    this than comp arch & org.  Why?  Because every _successful_ program
    you run will be linked and loaded.  And because you're more likely
    to have problems with linkage or loading than with a compiler that
    translates your code into the wrong assembly instructions.  Do you
    need both?  Yes!  But I think in pedagogy we tend to race from
    "high-level" languages to assembly and machine languages without
    considering question of how, in a hosted environment, programs
    _actually run_.

(2) char *myptr = 0;

    Yes, we're back to the original topic at last.  A zero isn't just a
    zero!  There is something visible in assembly language that isn't
    necessarily so in C code, and that is the addressing mode that gets
    applied to the operand!

    Consider the instruction "JP (HL)".  We know from the assembler
    syntax that this is going to jump to the address stored in the HL
    register.  So if HL contains zero, it will jump to address zero.  As
    it happens, this was a perfectly valid instruction and operation to
    perform back when I was kid (it would simulate a machine reset).

    So people talk even to this day about C being a portable assembly
    language, but I think they aren't fully frank about what gets
    concealed in the process.  Addressing modes are inherently
    architecture-specific but also fundamental.  I submit that using `0`
    as a null pointer constant, whether explicitly or behind the veil of
    a preprocessor macro, hides _necessary_ information from the
    programmer.

    For me personally, this fact alone is enough to justify a
    within-language null pointer constant.

    Maybe people will find the point easier to swallow if they're asked
    to defend why they ever use an enumeration constant that is equal to
    zero instead of a literal zero.  I will grant--some of them,
    particularly users of the Unix system call interface--often don't.
    But, if you do, and you can justify it, then you can also justify
    "false" and "nullptr".  (I will leave my irritation with the scoping
    of enumeration constants for another time.  Why, why, why, was the
    '.' operator not used as a value selector within an enum type or
    object, and 'typedef' employed where it could have done some good
    for once?)

    I don't think it is an accident that there are no function pointer
    literals in the language either (you can of course construct one
    through casting, a delightfully evil bit of business).  The lack
    made it easier to paper over the deficiency.  Remember that C came
    out of the Labs without fully-fledged struct literals ("compound
    literals"), either.  If I'd been at the CSRC--who among us hasn't
    had that fantasy?--I'd have climbed the walls over this lack of
    orthogonality.

I will grant that the inertia against the above two points was, and
remains, strong.  C++, never a language to reject a feature[6], resisted
the simple semantics and obvious virtues of a Boolean data type for
nearly the first 20 years of its existence, and only acquired `nullptr`
in C++11.  Prior to that point, C++ was militant about `0` being the
null pointer constant, in line with Doug's preference--none of this
shouty "NULL" nonsense.

I cynically suspect the feature-resistance on these specific two points
as being a means of reinforcing people's desires, prejudices, or sales
pitches regarding C++ as a language that remained "close to the
machine", because they were easy to conceptualize and talk about.  And
of course anything C++ didn't need, C didn't need either, because
leanness.  Eventually, I guess, WG21 decided that the battle front over
that claim was better defended elsewhere.  Good!

> Because I don't know what foo(a, b, 0, 0) is, and I don't know from
> memory the position of all parameters to the functions I use (and
> checking them every time would be cumbersome, although I normally do,
> just because it's easier to just check, but I don't feel forced to do
> it so it's nicer).

Kids these days will tell you to use an IDE that pops up a tool tip with
the declaration of 'foo'.  That this is necessary or even as useful as
it is discloses problems with API design, in my opinion.

> Was the third parameter to foo() the pointer and the fourth a length,
> or was it the other way around?  bcopy(3) vs memcpy(3) come to mind,
> although of course no-one would memcpy(dest, NULL, 0) with hardcoded
> 0s in their Right Mind (tm) (although another story is to call
> memcpy(dest, src, len) with src and len both 0).

The casual inconsistency of the standard C library has many more
manifestations than that.

> Knowing with 100% certainty that something is a pointer or an integer
> just by reading the code and not having to go to the definition
> somewhere far from it, improves readability.

I entirely agree.

> Even with a program[...] that finds the definitions in a big tree of
> code, it would be nice if one wouldn't need to do it so often.

Or like Atom[7], which has been a big hit among Millennial programmers.
Too bad, thanks to who acquired GitHub, it's being killed off and
replaced by Visual Studio now.  We call this market efficiency.

> Kind of the argument that some use to object to my separation of
> man3type pages, but for C code.

Here I disagree, because while we are clearly aligned on the null
pointer literal...er, point, I will continue to defend the position that
a man page for a header file--where an API has been designed rather than
accreted--is a preferable place to document the data types and constants
it defines, as well as either an overview or a detailed presentation of
its functions, depending on its size and/or complexity.

I think the overwhelming importance of the C library to your documentary
mission in the Linux man-pages project is obscuring the potential of
elegant man pages for _other_ libraries.  What passes for the C standard
library on _any_ system is quite a crumbcake.

The fact that you can't cram the C library into the shape I espouse
without making the pages huge and unwieldy is the fault of the library,
I humbly(!) suggest, not the fault of the principle.  But I'd be a long
way from the first person to note that the C library is schizophrenic
upon even cursory inspection.

Is it true that Plauger never wrote another book after his (dense but
incredibly illuminating, A+++, would read again) _The Standard C
Library_?  Did writing his own libc and an exposition of it "break" him
in some way?  I imagine that he got perhaps as good a look as anyone's
ever had at how much better a standard library C could have had, if only
attention had been paid at the right historical moments.  He saw how
entrenched the status quo was, and decided to go sailing...forever.
Just an irresponsible guess.

I would like to find a good place to state the recommendation that
documentors of _other_ libraries should not copy this approach of
trifurcating presentations of constants, data types, and functions.
In a well-designed API, these things will be clearly related,
mechanically coupled, and should be considered together.

What do you think?

As always, spirited contention of any of my claims is welcome.

Regards,
Branden

[1] I say that, but I don't see it in the 1973 "Revised Report" on the
    language.  I guess "nil" came in later.  Another Algol descendant,
    Ada 83, had "null".  And boy am I spoiling to fight with someone
    over the awesomeness of Ada, the most unfairly maligned of any
    language known to me.
[2] Like some CDC Cyber machines, I gather.  Before my time.
[3] Pop quiz: what assembly language did Branden grow up on?  It's hard
    to escape our roots in some ways.
[4] I won't go so far as to say speculative execution is _stupid_,
    though I wonder if it can ever actually be implemented without
    introducing side channels for attack.  Spec ex is performed, as I
    understand it, because Moore's Law is dead, and CPU manufacturers
    are desperately trying to satisfy sales engineers and certain
    high-volume customers who are only willing to pay for cores that the
    premium prices (profit margins) that the industry has been
    accustomed to for 50 years.  Purchasing decisions are made by suits,
    and suits like to get to the bottom line, like "number go up".
      
https://davidgerard.co.uk/blockchain/2019/05/27/the-origin-of-number-go-up-in-bitcoin-culture/
[5] https://ubiquity.acm.org/article.cfm?id=1513451
[6] Stroustrup would disagree and can support his argument; see his
    writings on the history of the language.
[7] https://en.wikipedia.org/wiki/Atom_(text_editor)

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]