Rambles around computer science

Diverting trains of thought, wasting precious time

Thu, 19 May 2011

Memtables

At the MMNet workshop in Glasgow last week, I talked about memtables. These are an efficient associative data structure, built using virtual memory support on modern OSes (currently implemented for Linux only), that are useful whenever you want to key a table on addresses in memory. See my slides for more.

Since entries with numerically similar keys are stored close to each other, memtables are, like certain other associative data structures, amenable to searching within a key range as well as exact-match lookups. By contrast, hash tables can't do this. (That said, a hash table supporting duplicate keys can be used to store items grouped into small equivalence classes. This is sometimes good enough, and could be made to work in my case. Nonuniform key duplication will mess up the O(1) nature of hash tables though.)

Memtables seem like they could be useful in lots of places. I invented them for DwarfPython as a fast way of storing and retrieving metadata given a key that may be an interior pointer (hence the searching requirement). I'm also (soon) using them in Cake as a fast way of tracking what objects are associated with what other objects.

The key space doesn't have to be addresses. It's possible we could even use memtables for free chunk binning, since large sizes are sparsely used. I need to do some more experiments to establish this.

The implementation comes in two parts:

  • A generic set of malloc hooks for glibc: these hooks aree “generic” in that they're designed to be easily specialised for various conceivable kinds of instrumentation. They're not generic with respect to the allocator---sadly they're specific to glibc, but most mature allocators should have some similar mechanism. The key usefulness in these hooks is factoring the various cases of the malloc API ---specifically the complex behaviour of realloc, but also other annoyances including memalign and null frees--- into an easy-to-use set of higher-level hooks. These are likely (but not guaranteed) to be a better match for whatever your instrumentation is doing. For example, defining a post_successful_alloc() function will hook all events that allocate a new heap block, whether they originated in a malloc(), a realloc() or a memalign().
  • a generic memtable library: this will appear soon! It's a set of hook definitions that maintain a memtable, and a lookup function.
  • Memtables are strictly faster than a hash table, at least for lookups, because they are basically a hash table without the hashing. At least for most applications of memtables, the table itself acts as an index for a bunch of linked lists---call them bins or buckets. Rather than mapping the key space onto a smaller hash space in order to keep the index small, we index directly by the key, and rely on the virtual memory trick to keep the index small. Since we can only save page-sized chunks of space, the key-space really needs to be used in a clustered and/or very sparse fashion. Just one used key per page of index is enough to allocate the whole table in physical memory, which we don't want. So if your table has four-byte entries, say, uniform key usage should be a lot less than one per thousand possible keys---but clusters of many are okay, so long as they're thousands apart.

    [/devel] permanent link contact

    Namespace problems

    It's always been a theoretical problem with C that there is no namespacing. I'd often wondered how much of a practical problem this really was, with “nothing major” my tentative answer. I've finally run into my first bona-fide gotcha arising out of this problem. In short: wxWidgets and GLib both define a GSocket.

    Annoying as this is, it wouldn't be in my top ten grumbles about the experience of programming in C. A far bigger related problem is versioning. This doesn't get cited as a weakness of C because it's also a weakness of most other programming languages. The very reason I ran into the namespace problem was because I had to compile wxWidgets 2.6, rather than using the 2.8 revision that's packaged for my distribution. Version mismatches can be seen as namespace collisions too. Instead of getting the version you want, the namespace has been populated with slightly different stuff that is, despite its close relationship to what you actually require, still incompatible, much the same as if the namespace were polluted with random third-party stuff.

    Versioning issues could perhaps be brought more under the programmer's control. Most programming languages don't have an explicit notion of “version” when importing stuff. But when explicitly consuming some target API, you are always assuming at least something about its version. Having the programmer declare which version of a set of declarations they want to import would be straightforward. In C, it could even be done quite neatly with just the preprocessor---say, #define __LIBFOO_REQUESTED_VERSION 4.2) before the relevant #include.

    Of course, pessimistically refusing to link across nominal mismatches of version would be a bad solution. We want a more structural and, indeed, behavioural or “semantic” approach. With the C preprocessor approach I outlined, it becomes the header file author's responsibility to embed a test about which prior API version the associated implementation is compatible with, most likely using a simple #if test. This responsibility is not unreasonable I'd say---the developers are in the best place to say what has changed with a new revision. And since it's in a header file, if the maintainers are lazy, the client programmer can override it.

    One shortcoming of this approach is that the client programmer might be too lazy to work out which is the earliest library version their code will work with, and will instead select whatever version they are developing with. This is safe, but prevents some valid compositions. On a different system with a slightly older version of the library, the header might conservatively conclude that it's not compatible with the client, even though it could work. Anyway, I don't worry about this too much. Lots of researchers have thought about versioning before, so there's probably some good solutions knocking around.

    Back to the sockets example, it's perhaps unsurprising that the name collision occurred when linking two chunks of infrastructure code. Name collisions are most likely when abstracting the same domain, having the same natural language vocabulary---namely sockets in this case. This is much more likely to happen in infrastructure software (i.e. modelling system resources) than application level software (modelling circles or ellipses or airline reservations or health records and so on), simply because you're less likely to link multiple instances of the latter together. Whereas application-level code is at or near the top of the software dependency graph, the infrastructure stuff is lower down so more likely to get sucked into a program through dependency.

    I was interested to note Nick Nethercote's recent blog entry about a problem with duplication (generally) and bloat (specifically) associated with multiple wrapper layers for system calls and other nonportable interfaces. He was talking about mmap(), but the socket abstraction is another example. I have some research proto-ideas that might help with this problem. Essentially I'm interested in recovering a more finer-grained style of interface description from code, based on the idea of “relational interfaces”. You could then use this description to infer that two sets of functions had very similar behaviour, and factor out the duplication (with appropriate refactoring or adaptation tools).

    This whole problem is another consequence of our fragile direct-interfacing, in-order methods for constructing of software. If we had a more flexible way of constructing software, the problem wouldn't arise. Rather than slavishly building on predefined interfaces that are specific to one underlying component---like one mmap() abstraction layer, or one socket abstraction--- we need smarter tools for specifying our requirements abstractly and finding customised ways of satisfying them using a range of “found” code. This is what my Onward! '09 proto-paper was ranting about. I guess it's good that I'm still ranting. Interface hiding is as good an idea as ever, and more work on it will happen, when I get time....

    [/devel] permanent link contact


    Powered by blosxom

    validate this page