Rambles around computer science

Diverting trains of thought, wasting precious time

Mon, 19 Apr 2010

Separating computation from storage

One of the nice things about laziness is that it eliminates the distinction between stored and computed values. You can write the same code without caring whether the values it manipulates were discovered by on-demand computation or were retrieved from some storage location. In purely functional languages, this works by throwing away all explicit notion of storage, then relying on the runtime to come up with a sensible execution strategy which exploits the underlying machine's storage. Experience shows that runtimes aren't yet clever enough to do especially good jobs of this: Haskell programs tend to use lots of memory, and/or to include programmer-inserted strictness annotations which hint at a better strategy.

The distinction between computation and storage in languages is well-known to be problematic. Stored data representations are a very change-prone design decision, hence justifying the common practice in object-oriented code (and other!) of using getters to access remote modules' state, rather than having them expose member variables directly. The computation-oriented interface is more general in that intervening computation can be inserted, or not---if the “got” representation matches what's in memory, the getter can just redirect to storage. Conversely, interposing on storage accesses, while possible using memory protection techniques (like the pairing of mprotect() and an appropriate SIGSEGV handler on Unix platforms) is inefficient on conventional hardware, violates most languages' abstraction boundaries and is not easy for user-level programmers to get working. This interposability motivation for getters and setters (and and the change-resilience it enables) is far stronger than a purely abstraction-oriented motivation. The argument is essentially the same as Parnas's from 1972, but this line of thinking still evades some bloggers.

Recently I've been writing some code using ptrace() and libunwind and finding myself with a particular requirement: implementing data structures that can work with various implementations of storage. Specifically, one feature of libunwind is that it can unwind the stack in your current process, or in another, using the same interface in each case. This kind of design is a Good Thing in much runtime infrastructure and debugging support generally, because you may or may not want a process separation in the picture: separation is good for isolation, but badly affects performance. Now, libunwind abstracts storage using a read--write pair of memory access functions. This is fine for simple reads and writes. Unfortunately I want something more demanding: I want to traverse some data structure residing in the target process. (As it happens, this data structure is some heap bookkeeping information that is maintained by a set of glibc malloc hooks I wrote as part of the Cake runtime.)

Generic programming ought to be great for this. Unfortunately, at least in the form of contemporary C++, it isn't enough. In C++, the notion of memory is so pervasive that it can't be fully abstracted. That's not to say you can't try to get most of the way there, and the STL's allocators go some way---but not far enough. Although we can alter how they allocate storage, since STL containers are not parameterised in the pointer types they use internally, we can't make them access their own implementation-specific data structures in a customised way. (See this snippet about overloading the & operator, from an article by Andrei Alexandrescu, for more evidence and examples.)

If we fall back on hand-coding data structures and algorithms ourselves, we can make some headway. The first step is to define a C++ “pointer-like thing” which actually uses the accessor functions. Here's my first attempt. Among other limitations, it can only read, not write, its pointed-to data.

template <typename Target>
class unw_read_ptr
{
    unw_addr_space_t as;
    Target *ptr;
    mutable Target buf; // HACK: temporary to make operator-> work
public:
    typedef unw_read_ptr<Target> self_type;
    unw_read_ptr(unw_addr_space_t as, Target *ptr) : as(as), ptr(ptr) {}
    Target operator*() const 
    { 
        Target tmp; 
        assert(sizeof tmp % sizeof (unw_word_t) == 0); // simplifying assumption
        unw_word_t *tmp_base = reinterpret_cast<unw_word_t*>(&tmp);
        for (unw_word_t *tmp_tgt = reinterpret_cast<unw_word_t*>(&tmp);
            tmp_tgt - tmp_base < sizeof tmp / sizeof (unw_word_t);
            tmp_tgt++)
        {
            off_t byte_offset 
             = reinterpret_cast<char*>(tmp_tgt) - reinterpret_cast<char*>(tmp_base);
            unw_get_accessors(as)->access_mem(as, 
                reinterpret_cast<unw_word_t>(reinterpret_cast<char*>(ptr) + byte_offset), 
                tmp_tgt,
                0,
                NULL);
		}            
        return tmp;
    }
    // HACK: operator-> brokenly demands return of a real pointer...
    // ... so use a per-object temporary. FIXME
    Target *operator->() const { this->buf = this->operator*(); return &this->buf; } 
    self_type& operator++() // prefix
    { ptr++; return *this; }
    self_type  operator++(int) // postfix ++
    { Target *tmp; ptr--; return self_type(as, tmp); }
    self_type& operator--() // prefix
    { ptr++; return *this; }
    self_type  operator--(int) // postfix ++
    { Target *tmp; ptr--; return self_type(as, tmp); }
    
    // we have two flavours of equality comparison: against ourselves,
    // and against unadorned pointers (risky, but useful for NULL testing)
    bool operator==(const self_type arg) { 
    	return this->as == arg.as
        && this->ptr == arg.ptr; 
    }
    bool operator==(void *arg) { return this->ptr == arg; }
    
    bool operator!=(const self_type arg) { return !(*this == arg); }
    bool operator!=(void *arg) { return !(this->ptr == arg); }

	// default operator= and copy constructor work for us
    // but add another: assign from a raw ptr
    self_type& operator=(Target *ptr) { this->ptr = ptr; return *this; }

    self_type operator+(int arg)
    { return self_type(as, ptr + arg); }

    self_type operator-(int arg)
    { return self_type(as, ptr - arg); }

    ptrdiff_t operator-(const self_type arg)
    { return this->ptr - arg.ptr; }
    
    operator void*() { return ptr; }
};

This has got me as far as being able to traverse a linked list residing in the target process using the same code you'd use to traverse a local one. Unfortunately, a linked list doesn't cut it for my performance requirements: the target process heap contains many thousands of allocated blocks, and I need to be able to resolve a heap address to a particular block quickly. So, perhaps a hash table or a red--black tree would be a good choice. This is where the pain hits: I really don't want to create my own implementations of either of these. I could cannibalise the source of existing one (and I think that's just what I'm going to do) but it'd be nice to take an STL-like container and just use it as-is. (I am planning to use google-sparsehash, and create a hacked version of the lookup function, using my special pointer class above, for the “separate process” case.)

A final conclusion is that polymorphism is all very well, but only when the programmer can be oblivious to it. Polymorphism is after all a very low-level concept. Why should we require separate implementations of a function for operating on different kinds of data? From low-level programming we have got used to the idea that data comes in different forms, like ints and floats and lists and arrays and so on, and that these are treated separately unless some polymorphic cleverness unifies them. But in a truly high-level programming language, it should be a given that when your code is abstract with respect to your data structures' representations, or with respect to any other underlying logic (such as memory access, in my example), then you can mix-and-match any implementation you like for that underlying logic.

Under this criterion, ML-style parametric polymorphism wins nicely, because type inference means that the programmer doesn't need to care about the machinery surrounding polymorphism. In languages where programmer anticipation is required---such as by adding particular type parameters in the case of C++ templates, or by writing particular type annotations as one might in Haskell---then we are forcing the programmer to be aware of these low-level distinctions, so have not yet delivered obliviousness. (I distinguish Haskell from ML because idiomatically, my limited experience suggests that Haskell programs seem to contain an awful lot more type annotations than ML programs do. I will stand corrected if this turns out not so or not important!) Even ML inflicts polymorphism on the programmer in its indecipherable compile-time type errors, but maybe someone can write a compiler which makes things comprehensible.

[/research] permanent link contact


Powered by blosxom

validate this page