[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] TinyCC REPL

From: Jared Maddox
Subject: Re: [Tinycc-devel] TinyCC REPL
Date: Thu, 14 May 2015 21:39:20 -0500

> Date: Thu, 14 May 2015 08:57:30 -0400
> From: David Mertens <address@hidden>
> To: address@hidden
> Subject: Re: [Tinycc-devel] TinyCC REPL
> Message-ID:
>         <address@hidden>
> Content-Type: text/plain; charset="utf-8"

>> I pushed to the mob your changes to the win32/inlcude/winapi
> Those changes were the minimum necessary so that tcc could compile perl.h
> on Windows. Is that a sensible criterion for including them in tcc's source
> code?
> David

I agree with Sergey, compiling Perl is a good reason for header changes.

> Date: Thu, 14 May 2015 09:35:12 -0400
> From: David Mertens <address@hidden>
> To: address@hidden
> Subject: Re: [Tinycc-devel] TinyCC REPL
> Message-ID:
>         <address@hidden>
> Content-Type: text/plain; charset="utf-8"

Out of curiosity, have you added an assert() to confirm that the
character set is ASCII instead of something else? I think ASCII is the
norm, but main-frames aren't the only place I've heard of non-ASCII
encodings (not saying you should handle it, just test it; handling can
be implemented by someone with a weird machine - I think it was

> Hey Sergey,
> Thanks for this! I am impressed at your efforts! Unfortunately, I think
> this is a bit premature. I have more work to do on my exsymtab fork before
> I was going to bring it up for inclusion in tcc itself. (Indeed, I am not
> entirely sure that it is appropriate for inclusion in tcc.) Here are a
> couple of reasons we should wait for a little while longer:
>    1. The extended symbol table API is not documented at all.
>    2. The symbol table copy takes O(N_sym^2) to run. It might be possible
>    to speed this up, but for the moment I plan to work around it by
>    implementing symbol table caching. I have not yet completed that work, so I
>    consider the exsymtab project to be incomplete at the moment.

You refer to what I understand to be the symbgol table as an Array
Mapped Trie. I'm not familiar with the structure, but it looks like an
array that someone shoved a bunch of trie data structures into. How
exactly does ::filled_bits work? Is it a bit field indicating which
children (I assume placed in relation to their parent) have data, and
which don't?

>    4. As implemented, this cuts the maximum number of token symbols in
>    half. (I needed to use one of the bits to indicate "extended symbol".)
>    5. The current token lookup is based on a compressed trie that
>    explicitly only supports A-Z, a-z, 0-9, and _. It does not support $ in
>    identifiers and would need to be revised in order to do so. I chose to
>    implement Phil Bagwell's Array Mapped Trie
>    <http://lampwww.epfl.ch/papers/triesearches.pdf.gz> in the belief that
>    it would perform better than a hash for lookup. Once I add symbol table
>    caching, I hope to add (or switch to) Array Compressed Tries for even
>    better cache utilization. But currently, I rely on have 63 allowed
>    characters in identifiers.

Assuming that my understanding above is right (or at least the
bitfield part is), how about reducing the allowed number of children
for a single trie node, and use a custom encoding to indicate that
some particular trie child node (let's say the first) is a
"continuation" node, containing the OTHER half (or third, or whatever)
of the children? The source already mentions the assumption that the
upper-case letters are used more rarely. Depending on choosen method
(something derived from UTF-8, perhaps? Nice and simple, after all,
implementable purely via comparisons: instead of indicating extension
bytes, the highest set bits would indicate "subsidiary nodes"),
extension to full Unicode might be possible (if improbable, and

As an example, assuming that only one continuation node was allowed
per parent, the parent would have 62 - 2 bits, for 60 symbols if the
highest bit was set, and 61 otherwise. Meanwhile, the continuation
node would always have 62. The result is up to 122 - "extended symbol"
== 121 available symbols: even if you add $, an ASCII C++ operator
declaration doesn't allow even as many as 100 characters, much less C,
so for relatively little cost (if the assumptions about character
occurance ratios are "close enough") you can easily cover more than is
needed, thereby providing space for e.g. other languages at a later
point in time.

>    7. A separate idea that I plan to pursue on my fork is to extend how tcc
>    pulls data in from file handles. I would like to make it hookable so that I
>    could write hooks in my Perl module and have it interact directly with
>    Perl's parser, rather than pulling all of the C code into a temporary
>    buffer. This may go beyond the wishes of the community and merits further
>    discussion.

It's been talked about before. I believe that Grischka vetoed it at
the time. Still, for e.g. shipping a "boot environment" with TCC (for
crazy no-filesystem experimenting), it would be invaluable.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]