Rambles around computer science

Diverting trains of thought, wasting precious time

Wed, 16 Sep 2015

Partially evaluating a bytecode interpreter using C++ templates

I should start this by saying that I'm definitely not a C++ metaprogramming genius. People have done outrageously clever and diverting things with C++ templates, and I am not about to outdo anyone. Nevertheless, this might be interesting.

I was facing the task of taking code in a simple but obscure stack machine language—the stack machine from DWARF, if you couldn't guess—and turning it into vaguely efficient machine code. My thinking was spurred on by systems like Truffle, or PyPy's tracing JIT, which produce compiled code from only source code and an interpreter for the source language. Unlike those systems, I didn't need dynamic compilation—offline is fine. But also, I wanted a system with easy support for C and C++ intrinsics, zero runtime baggage, and something I could prototype in a day or so. It turned out that writing the interpreter in C++ templates, and using the C++ compiler as a partial evaluator, was reasonably successful at all these. Here's how it works.

To start with, I simplified right down from DWARF to an analogous but much simpler stack machine where all operands are passed on the stack. We model instructions using simply an enumeration. This is not essential, or even desirable, except for exposition.

/* instruction -- a simple enumeration */
enum instr
{
    PLUS, /* add top two stack slots and push result */
    BZ,   /* pop val and dest off stack; if val is zero, branch to dest */
    PEEK, /* pop an address off the stack and push its contents */
    DUP2  /* clone the top two elements of the stack */
};

[I've edited this slightly from my initial version of this post, to add the DUP2 instruction. This allows me to include a slightly more meaningful example program—thanks to Ross Bencina for prompting that.]

The four instructions are representative of the classes of operation I care about: arithmetic, control, C intrinsic and stack manipulation. They're not a very complete machine, but they suffice to illustrate things. To witness the power of this somewhat-operational stack machine, a simple program which combines all four opcodes is as follows.

typedef program< PLUS, program< DUP2, program< BZ, program< PEEK, void > > > > prog;

This program add two numbers taken as input, and if they sum to something non-zero, it interprets that as an address and peeks it, returning what's stored there. Otherwise it returns zero. Note that since we only have a computed branch instruction, not a direct branch, the branch target exists as data in the initial stack. From top to bottom, our initial stack must be [arg1, arg2, 4], where arg1 and arg2 are the two input values, 4 is the program counter to branch to (the end of the program). We don't need to store the literal zero in this case (I'll let you figure out why).

So, clearly, a program is a sequence of instructions. We model it in the C++ compiler as a template-level sequence, with the opcode as one template argument and the program “tail”, i.e. the following instructions, as another template argument.

/* program -- a sequence of opcodes */
template <unsigned Op, class Next>
struct program
{
    static const long val = Op;
    typedef Next tail;
};

In effect, this gives us a “linked list” structure but at the template level. We can write some simple helper functions over this kind of structure, using a functional style in which template specialization is how we do case-splitting and template instantiation is how we do recursion.

template <class Seq> 
struct sequence_length
{
    static const long value = 1 + sequence_length<Seq::tail>::value;
};

template <>
struct sequence_length<void>
{
    static const long value = 0;
};

/* Utility for getting the tail of a type-level sequence. */
template <typename Seq>
struct tail_of
{
    typedef typename Seq::tail type;
};
template <>
struct tail_of<void>
{
    typedef void type;
};

/* Utility for getting the nth of a sequence, then the rest;
 * -- it's like "let s = drop n Seq in (hd s, tl s)" */
template <unsigned n, class Seq> // n > 0 case
struct nth
{
    static const long val = nth<n-1, typename tail_of<Seq>::type>::val;
    typedef typename tail_of< nth<n-1, typename tail_of<Seq>::type> >::type tail;
};
/* termination case: n == 0 */
template <class Seq>
struct nth<0, Seq>
{
    static const long val = Seq::val;
    typedef typename tail_of<Seq>::type tail;
};

The main use of sequences is to model the program itself, a sequence of instructions. We also use them to model the program counter, encoding it simply as the “tail” of the program, i.e. the subsequence of instructions from the current position to the end of the program. (This won't stop us from jumping backwards, because we'll always pass around the whole program too. I'll come on to jumps a bit later.)

What's our program state? It's a program counter and an operand stack. Since, in general, our stack will not be statically known, we have to model it at the value level, passing it around as “real” arguments at run time rather than template arguments. (I did create a version where a parallel family of template specializations could be used to encode statically known stacks. That's neat because it you can get the C++ compiler to partially evaluate the stack-machine program for known input. But I'm skipping that extra complexity here.)

To actually represent the stack at run time we'll use a plain old array of long. I'll show you that in a second. Before I do, let's start to piece together what our evaluator looks like. It's a template that takes arguments standing for the whole program and the program tail, i.e. program counter. The whole program is always statically known, so is a template argument. There are finitely many possible values of the program counter, so we can make the compiler enumerate them; this means the program tail can be a template argument too. (This works nicely with how we do branches; I'll come on to that.)

/* States in a computation */
template <
    /* Program */ typename Program,
    /* ProgramTail */ typename ProgramTail    /* i.e. a suffix of the Program, 
                                                 beginning with the next instruction to run. */
    >
struct state
{
    static long eval(/* some args... */)
    {
        /* figure out our next state and delegate to it... */
    }
};

We also want each state to support an eval() method which executes the program's continuation from that state. Most of the time this will work by calculating the next state and delegating to its eval() method. For states that are “end states”, like when we hit the end of the program, we'll instead return something—the value on the top of the stack. We can handle this case using another specialization (which we'll define later).

A simple opcode's eval() method looks something like the following, which implements PLUS. We pass the stack as an argument to the eval() method, in the form of two pointers, bottom and top.

template <
    /* Program */ typename WholeProgram,
    /* ProgramTail */ typename ProgramTail
    >
struct state< WholeProgram, program<PLUS, ProgramTail> >
{
    static long eval(long *bottom, long *top)
    {
        top[-1] = top[0] + top[-1];
        return state<WholeProgram, ProgramTail >::eval(bottom, top - 1);
    }
};

Here we destructively update the stack, decrementing the top. Note that the whole interpreter works in continuation-passing style, and the current stack is always passed around explicitly and linearly—all our eval() calls are tail calls. Currently we malloc() the stack eagerly to a large-ish size, and abort on overflow. But linearity means we could even start small and then realloc() it if it overflowed. A neater trick would be to make it an actual stack that is lazily allocated by the OS, using mmap()—this would give us gradual allocation without cluttering our instruction stream with overflow checks or realloc() calls (but I haven't implemented this yet).

That's the basic idea of our evaluator: each opcode can be implemented by a template specialization which gets elaborated differently according to its position in the program. What about branches? We can handle branches by a single specialization for the case where the PC is not known statically. As before, we deal with the not-statically-known value by accepting it as a method (value) argument rather than a template (type) argument. The simplest and fastest way to do this is to code up a jump table. For this to work, we also have to bound the size of programs.

#define MAX_PROGRAM_SIZE 64

struct run_time_pc {};

template <
    /* Program */ typename WholeProgram
    >
struct state< WholeProgram, run_time_pc >
{
    static long eval(long *bottom, long *top, long pc)
    {
        switch (pc)
        {
#define CASE(n) \
case n: return state< \
        WholeProgram, \
        program< \
            /* We build the tail starting from the nth, using the definitions in `nth<>'. */ \
            nth<(n), WholeProgram>::val, \
            typename nth<(n), WholeProgram>::tail \
        > \
    >::eval(bottom, top);
            CASE(0)
            CASE(1)
            CASE(2)
            CASE(3)
            /* ... */
            /* MAX_PROGRAM_SIZE */
            default: 
                assert(pc &gt;= MAX_PROGRAM_SIZE); 
                warnx("branched out of bounds");
                abort();
        }
    }
};

Now, to implement a branch opcode, all we need to do is have it call this run_time_pc specialization's eval(), which takes an extra argument: the destination program counter, passed as a value. Passing it as a value is what allows the jumping instruction to actually decide the destination at run time. At worst, the C++ compiler compiles this switch statement down to a jump table which can jump to the code implementing any of the program's instructions. If this happens, the compiler must enumerate all these destinations up-front in order to build the jump table, hence why we had to bound the program size. At best, though, if we're doing a direct branch, the branch target will be statically known in the caller of this eval() template, allowing the C++ compiler to optimise away the dead switch cases, replacing the jump table with a direct branch. (In our current stack machine, all branch destinations are encoded on the stack, so are potentially data-dependent, meaning we don't hit this direct-branch case. But in the actual DWARF stack machine, we can encode direct branch targets into the instruction, so we will hit this case.)

Since each entry in this jump table jumps to a specific part of the program, we immediately transition back to having a statically known PC value. Our implementation of branches is completely self-contained within this one mechanism.

Actually, make that two mechanisms, because I want to add an alternative. We can avoid hackily bounding the size of the program, if we don't mind rewriting our jump table as a chain of ifelse branches. We can use templates to generate this chain of branches, for any length. To do this, we parameterise our run_time_pc type on the upper bound of what program counter value we might be jumping to. Initially this is the length of the program; recursive elaborations' ifelse-tests whittle this down to zero, the lowest-numbered place we could jump to. In this way, the compiler effectively builds our switch statement (as an if–else chain) for us. As before, if it's a direct branch, all the tests will get elaborated away. If it's not, whether we get a nice jump table depends on how clever the compiler is. For me, sadly g++ seems not to generate a jump table, even at -O3, but perhaps it can be coaxed into it somehow (let me know if you know!).

template <unsigned UpperBound> 
struct run_time_pc {};

template <
    /* Program */ typename WholeProgram,
                  unsigned MaxPc
    >
struct state< WholeProgram, run_time_pc<MaxPc> >
{
    /* This is fast only if the C++ compiler can turn if--else chains back into jump table. */
    static long eval_if_else(long *bottom, long *top, long pc)
    {
        if (pc == MaxPc) return state<
            WholeProgram,
            program<
                /* We build the tail starting from the nth, using the definitions in nth<>. */
                nth<MaxPc, WholeProgram>::val,
                typename nth<MaxPc, WholeProgram>::tail
            > 
        >::eval(bottom, top);
        else return state<WholeProgram, run_time_pc<MaxPc - 1> >::eval_if_else(bottom, top, pc);
    }
};
template <
    /* Program */ typename WholeProgram
    >
struct state< WholeProgram, run_time_pc<0> >
{
    static long eval_if_else(long *bottom, long *top, long pc)
    {
        if (pc == 0) return state<
            WholeProgram,
            program<
                /* We build the tail starting from the nth, using the definitions in nth<>. */
                nth<0, WholeProgram>::val,
                typename nth<0, WholeProgram>::tail
            > 
        >::eval(bottom, top);
        else abort();
    }
};

I promised to show how we deal with the end of the program. We just specialize for the case where the program tail is void, i.e. the empty sequence.

/* This specialization handles "end of program". */
template <
    /* Program */ typename Program,
    >
struct state<Program, void>
{
    static long eval(long *bottom, long *top)
    {
        if (top <= bottom) abort();
        return top[0];
    }
};

There are a few more fiddly things to figure out to make it all compile, but I've shown you all the interesting stuff. A working snapshot of the implementation is here. It compiled only with g++, not clang++, the last time I tried; fixes are welcome. It implements everything I've mentioned. In the near future I'll turn it into an actual DWARF compiler, and it will appear in liballocs. You might ask: why? In short, compiled DWARF expressions are hopefully a way to achieve faster introspection on complex memory layouts. The liballocs metadata toolchain already “compiles” much DWARF content down to an efficient in-memory representation. By compiling not only manifest DWARF structure but also computational DWARF expressions, we can efficiently introspect on structures whose layout can only be described precisely via a computation—think variable-length arrays, hash tables, and so on. Eventually I also want to support “compiled frame information”, which I hope will allow faster stack walking.

[/research] permanent link contact


Powered by blosxom

validate this page