Re: [Tinycc-devel] TinyCC REPL

On Thu, May 14, 2015 at 10:39 PM, Jared Maddox <address@hidden> wrote:

> Date: Thu, 14 May 2015 09:35:12 -0400
> From: David Mertens <address@hidden>
> To: address@hidden
> Subject: Re: [Tinycc-devel] TinyCC REPL
> Message-ID:
> <address@hidden>
> Content-Type: text/plain; charset="utf-8"
>

Out of curiosity, have you added an assert() to confirm that the
character set is ASCII instead of something else? I think ASCII is the
norm, but main-frames aren't the only place I've heard of non-ASCII
encodings (not saying you should handle it, just test it; handling can
be implemented by someone with a weird machine - I think it was
embedded).

Not a bad idea, thanks. I've added an issue to my repo on github so I don't forget this.

You refer to what I understand to be the symbgol table as an Array
Mapped Trie. I'm not familiar with the structure, but it looks like an
array that someone shoved a bunch of trie data structures into. How
exactly does ::filled_bits work? Is it a bit field indicating which
children (I assume placed in relation to their parent) have data, and
which don't?

I think your understanding is essentially correct. I'm not really sure why Phil Bagwell used the term "Array Mapped Trie", but I'm just following his choice. His data structures are discussed in the paper at http://lampwww.epfl.ch/papers/triesearches.pdf.gz. In my implementation, the compressed trie is declared as

struct compressed_trie {

unsigned long long filled_bits;

struct compressed_trie children[1];

}

(Notice that I always have at least one child.) As I came from C++, this sort of construction was new to me, but I like it. For example, a node with three children would essentially have the layout

struct compressed_trie {

unsigned long long filled_bits;

struct compressed_trie * first;

struct compressed_trie * second;

struct compressed_trie * third;
};

The letter for the trie that corresponds to the first, second, and third child depends on which bits are occupied in filled_bits. The point is that any node only allocates exactly the number of pointers for child nodes that it needs, rather than having room for 63 child nodes, most of which go unused.

> 4. As implemented, this cuts the maximum number of token symbols in
> half. (I needed to use one of the bits to indicate "extended symbol".)
> 5. The current token lookup is based on a compressed trie that
> explicitly only supports A-Z, a-z, 0-9, and _. It does not support $ in
> identifiers and would need to be revised in order to do so. I chose to
> implement Phil Bagwell's Array Mapped Trie
> <http://lampwww.epfl.ch/papers/triesearches.pdf.gz> in the belief that
> it would perform better than a hash for lookup. Once I add symbol table
> caching, I hope to add (or switch to) Array Compressed Tries for even
> better cache utilization. But currently, I rely on have 63 allowed
> characters in identifiers.

Assuming that my understanding above is right (or at least the
bitfield part is), how about reducing the allowed number of children
for a single trie node, and use a custom encoding to indicate that
some particular trie child node (let's say the first) is a
"continuation" node, containing the OTHER half (or third, or whatever)
of the children? The source already mentions the assumption that the
upper-case letters are used more rarely. Depending on choosen method
(something derived from UTF-8, perhaps? Nice and simple, after all,
implementable purely via comparisons: instead of indicating extension
bytes, the highest set bits would indicate "subsidiary nodes"),
extension to full Unicode might be possible (if improbable, and
ill-advised).

As an example, assuming that only one continuation node was allowed
per parent, the parent would have 62 - 2 bits, for 60 symbols if the
highest bit was set, and 61 otherwise. Meanwhile, the continuation
node would always have 62. The result is up to 122 - "extended symbol"
== 121 available symbols: even if you add $, an ASCII C++ operator
declaration doesn't allow even as many as 100 characters, much less C,
so for relatively little cost (if the assumptions about character
occurance ratios are "close enough") you can easily cover more than is
needed, thereby providing space for e.g. other languages at a later
point in time.

Excellent idea! Using this idea, I could even map all 256 possibilities through the trie and substantially simplify _c_trie_bit_offset_for_char. In fact, this could lead to transparent support for unicode.

David

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan

From:	David Mertens
Subject:	Re: [Tinycc-devel] TinyCC REPL
Date:	Fri, 15 May 2015 09:32:21 -0400