Rambles around computer science

Diverting trains of thought, wasting precious time

Mon, 18 Oct 2021

ELF dynamic linking: a brief introduction

[I wrote this to help current or future students who might want to do projects involving dynamic linking. But it may be of interest more widely. There may well be an equivalent or better article out there; if anyone knows one, please let me know.]

In old-fashioned static linking, the executable file captures a complete starting “memory image” of the program being loaded. The kernel simply maps the binary into memory, creates an initial thread stack, dumps some useful information onto it (the auxiliary vector) followed by the program name and argument strings, then transfers control to the entry point identified in the ELF binary's header. So far, so straightforward. I'll assume readers familiar with this. (If not, among other resources, you could read section 3 of the OOPSLA 2016 paper I co-authored.)

In dynamic linking, the executable no longer provides a complete image; rather, it is one of a collection of “dynamic shared objects”, or DSOs, that provide the starting memory image. (Normally, the executable itself is counted as a DSO, even though it may not be “shared”.)

The dynamic linker composes this image at load time, by loading both the executable and the (transitive closure of) depended-on library DSOs.

Dynamic linking is delegated by the kernel to user space: the kernel still only knows how to load static binaries, and the dynamic linker is a special static binary that knows how to load (and link) dynamic binaries. The kernel is extended just enough to split this case off: it must figure out, after loading an executable, that actually the dynamic linker is needed to complete the loading. The dynamic linker can be any binary nominated by the executable file, but usually the standard one (for me: /lib64/ld-linux-x86-64.so.2) is the only one on the system. In practice, the kernel will map both the executable and the dynamic linker into memory, so the dynamic linker only normally needs to load the libraries, not the executable. (Sometimes the dynamic linker can also be run as a program, in which case it has to load the executable too.)

DSOs have dependencies on other DSOs, in the form of NEEDED records. You can see these in the dynamic linking information dumped by readelf -d. For example, here is what I get from a simple hello C program, which needs only one library.

Dynamic section at offset 0x2df8 contains 26 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x1000
 0x000000000000000d (FINI)               0x11b4
 0x0000000000000019 (INIT_ARRAY)         0x3de8
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x3df0
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x308
 0x0000000000000005 (STRTAB)             0x3d8
 0x0000000000000006 (SYMTAB)             0x330
 0x000000000000000a (STRSZ)              130 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x4000
 0x0000000000000002 (PLTRELSZ)           24 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x548
 0x0000000000000007 (RELA)               0x488
 0x0000000000000008 (RELASZ)             192 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffb (FLAGS_1)            Flags: PIE
 0x000000006ffffffe (VERNEED)            0x468
 0x000000006fffffff (VERNEEDNUM)         1
 0x000000006ffffff0 (VERSYM)             0x45a
 0x000000006ffffff9 (RELACOUNT)          3
 0x0000000000000000 (NULL)               0x0

These records give just the names of the depended-on DSOs (actually their soname—more below). Where in the filesystem these binaries are to be found depends on the linker search path, which is controlled by your LD_LIBRARY_PATH environment variable, by a configuration file, and by any RUNPATH records in the loaded file (also visible using the command above, although our hello does not contain any) and, transitively, those in any files loaded to satisfy a NEEDED dependency. The ldd command asks the dynamic linker to do a trial run and show where it located the dependencies (or failed to!).

In memory, each DSO gets a range of the virtual addess space, starting at some load address. So in the DSO's ELF binary on disk, all “addresses” are actually offsets relative to the load address. It used to be that executables always had a load address of 0, so the executable file's addresses were bona fide virtual addresses. However, nowadays executables are typically position-independent and get loaded at a non-zero address too.

(DSOs are mostly contiguous in memory. However, nothing prevents them from containing holes in between their mapped segments. In edge cases it is possible for holey DSOs to be loaded such that they overlap. This is rare, but I did have to go to considerable lengths to support it, or rather prevent it, in one of my runtime projects.)

After loading, there may be some load-time relocation required. This is performed by the dynamic linker and is very much like the relocation that happens during static linking, but is used to resolve inter-DSO references. (In some degenerate cases, it also resolves intra-DSO references that have been deferred until load time.) Many of the entries in the output shown above are telling the dynamic linker where to find the relocation and symbol information for this.

A “shared library” is a library DSO whose text segment does not need load-time relocation. In other words, it works fine as-is when loaded at any address. This flexibility to map the file simultaneously at many addresses, without the need to fix it up in memory after loading, is what allows memory to be shared between many processes under ELF dynamic linking. (Some other linking systems systems, including Windows DLLs, instead force the library to be placed at the same virtual address in all address spaces, hence the infamous process of “registering” DLLs.) Elimination of any need for load-time relocation is achieved by indirecting inter-DSO references that would otherwise be (say) direct call instructions, so that they occur instead via a couple of (non-shared) tables in the data segment: the Global Offset Table (GOT) and Procedure Linkage Table (PLT). Roughly, the latter is used for calls, the former for data access. Others have written about those (see Eli Bendersky's very nice in-depth article), so let me just say that the GOT is exceptionally poorly named: it is not global (there is one per DSO), and does not contain offsets (it contains pointers, at least after relocation). They should both arguably be called “vectors” rather than tables, because they don't have multiple columns.

Despite the “S” in “DSO”, there is no requirement that library DSOs satisfy this shareability property. Instead, ELF linkers usually complain if you try to create a DSO that does not (often via an inscrutable message ending with “recompile with -fPIC”). The dynamic linker doesn't care, however. If it has to fix up your text, it can do so, relying on copy-on-write to allocate memory as needed. Shared libraries are “shared” only because that copying doesn't happen.

Since libraries and executable are now deployed as separate files, they are subject to version skew. Libraries often try to offer backward compatibility, such that a newer library can still support an older executable (or, equally, support some old library declaring it as NEEDED). A vaguely standard symbol versioning mechanism is defined for ELF, to allow a library to offer older and newer versions of “the same” symbol and have references get bound to the right one, preserving binary compatibility. Alternatively, a binary-incompatible version of a shared library may be issued, in which case it's good practice to reflect that in the library name. There is a convention for recording version bumps in a number that forms part of the “soname”, or shared object name, hence “libc.so.6” in the above listing. This suggests six version-bumps have occurred in the GNU C library. In this way, many distinct incompatible verisons of a library may be installed on the same system. Symbol versioning takes a bit more discipline and more build-time ceremony to set up, but it tends to be preferred over sonames, especially for libraries that are widely deployed. The GNU C library has not bumped its soname for many years.

DSOs have identity at run time. The dynamic linker maintains a doubly linked list called the “link map”. It is possible to load new DSOs, such as plug-ins, during execution, via the dlopen() library function. This adds them to the link map, alongside all the libraries loaded earlier. Indeed the dynamic linker itself shows up in the link map as just another library, providing (among other things) the dlopen call (or at least its back end; on GNU systems there is a separate libdl as a front-end). It also provides a symbol that gives access to the link map itself (_r_debug). Over this, a breakpoint-based protocol allows debuggers to walk the link map, so they can find out about the libraries that are loaded. DSOs can also designate initialization and finalization routines to be run when they are loaded and unloaded.

That's enough for a brief introduction. To get more details about what's going on at run time, you could read my older post about ELF introspection.

A few years back, Simon Ser did some great work on an executable semantics for dynamic linking, as a summer internship of the REMS project. Since then I have been struggling to get time to resurrect and publish this work, and it no longer falls directly within my prioritised interests. But if anyone is interested in taking it up, do contact me... it is begging to be polished off and written up.

[/devel] permanent link contact

Thu, 14 Oct 2021

Tracing system calls in-process, using a chain loader

Some months back I wrote about a possible better alternative to LD_PRELOAD using chain loading. And I promised a follow-up with more details... so here it is. We'll walk through a little system call tracer built using this technique. Unlike strace and similar, it will do its tracing in-process, without using ptrace() or equivalents. I should add that it's a proof-of-concept only so isn't nearly as good as strace in practice; in particular it can't yet accurately print the arguments to most syscalls. (I have some work towards doing that generatively, using kernel DWARF; more later.)

System call tracing is a good demo of chain loading, for three reasons.

Firstly, it's basically impossible to do with LD_PRELOAD, which works at the higher level of the C library and dynamic linker, so doesn't have sight of system calls.

Secondly, it's close to what people very often want to do with LD_PRELOAD, which is to change the behaviour of particular system calls. No longer do we have to make do with overriding the C library wrappers; we can intercept the system calls themselves, catching certain cases that would not be visible at the C library level.

Thirdly, intercepting system calls specifically provides the foundation for a more general bootstrapping approach to arbitrary binary instrumentation of programs, as I hinted last time. If we can catch all the system calls which introduce new instructions to the address space, and prevent instructions from being introduced otherwise, we can ensure that literally all code is instrumented.

Some years ago, Patrick Lambein and I wrote a library that can patch and emulate most system calls on x86_64 Linux, by a really simple double-trap technique: overwrite the system call with ud2, x86's “always-illegal” instruction, then handle the resulting SIGILL. The handler can do the real syscall or whatever else it wants, before returning. This is not the fastest, but it's reliable and easy(-ish) to implement—at least to 90% of the way. So, for example, if we disassemble the C library's _exit function which once contained the following instructions

44 89 c0                mov    %r8d,%eax
0f 05                   syscall  
48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
76 e2                   jbe    0x7f173bf269c0 <_exit+32>

under our library this gets clobbered to instead be

44 89 c0                mov    %r8d,%eax
0f 0b                   ud2  
48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
76 e2                   jbe    0x7f173bf269c0 <_exit+32>

which will always cause a SIGILL. Our SIGILL handler scrapes up the syscall arguments into a structure and then passes this to whatever registered handler is installed. For example, a tracing handler, which prints some stuff and then resumes the system call, looks like this.

void print_pre_syscall(void *stream, struct generic_syscall *gsp, void *calling_addr, struct link_map *calling_object, void *ret) {
        char namebuf[5];
        snprintf(namebuf, sizeof namebuf, "%d", gsp->syscall_number);
        fprintf(stream, "== %d == > %p (%s+0x%x) %s(%p, %p, %p, %p, %p, %p)\n",
                calling_object ? calling_object->l_name : "(unknown)",
                calling_object ? (char*) calling_addr - (char*) calling_object->l_addr : (ptrdiff_t) 0,
// what actually gets called from the SIGILL handler
void systrap_pre_handling(struct generic_syscall *gsp)
        void *calling_addr = generic_syscall_get_ip(gsp);
        struct link_map *calling_object = get_highest_loaded_object_below(calling_addr);
        print_pre_syscall(traces_out, gsp, calling_addr, calling_object, NULL);

To ensure that all loaded code gets instrumented appropriately with the ud2s, I've recently expanded this to use a bootstrapping technique: after initially applying our trapping technique to hook mmap(), mremap() and mprotect() calls, our hooks ensure that if the guest application is asking to create any further executable memory, we will instrument its contents before we make it executable.

But how can we get all this code into the process in the first place?. What I did initially was use our old familiar LD_PRELOAD. We preload a library whose initializer stomps over the loaded text, clobbering syscalls into ud2 and installing the SIGILL handler. This is fine as far as it goes, but it misses the very beginning of execution. A tracer doing this cannot trace anything happening before the initializer runs, for example in other libraries' initializers or inside the dynamic linker. What we get is the following.

$ LD_PRELOAD=`pwd`/trace-syscalls.so /bin/true
== 26439 == > 0x7f6fb2a4f9d4 (/lib/x86_64-linux-gnu/libc.so.6+0xc99d4) exit_group(0, 0x3c, 0, 0x7fff766064b0, 0xe7, 0xffffffffffffff80)

It's neat that it looks like true only does one syscall, to exit_group(), as we might think it should. But if we try strace we find that in truth the process does a lot more than that one system call.

$ strace /bin/true
brk(0)                                  = 0x2447000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6ba773c000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libfakeroot/tls/x86_64/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/libfakeroot/tls/x86_64/x86_64", 0x7ffd49ca38e0) = -1 ENOENT (No such file or directory)
... snipped a *lot*! ...
arch_prctl(ARCH_SET_FS, 0x7f6ba7578740) = 0
mprotect(0x7f6ba7732000, 12288, PROT_READ) = 0
mprotect(0x605000, 4096, PROT_READ)     = 0
mprotect(0x7f6ba7765000, 4096, PROT_READ) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

All these other, earlier system calls are happening during dynamic linking, or in general, before our constructor that clobbers the syscalls into ud2. (The strace manual page laments: “it is a pity that so much tracing clutter is produced by systems employing shared libraries”. My take is that it's better to know the truth!)

Let's repackage our code as a chain loader, as described last time. That way it will always run first, and because it's responsible for loading the dynamic linker, we can ensure that all its syscalls are trapped from the very start. This is a non-trivial delta to the donald-based code I mentioned in the last post. After we map the “real” dynamic linker, we also need to instrument its code. We do this by wrapping the enter() function from donald.

void __wrap_enter(void *entry_point)
        /* ... */
        trap_all_mappings(); // only ld.so is loaded right now
        /* ... */

What about code that the dynamic linker loads? Well, we use our bootstrapping approach to catch that. Whenever new code is loaded into the process, some system call is responsible for doing that. So as long as we catch that call, we can trap the system calls at load time. A bit of macro magic helps generate the wrappers, and some elided hackery is required to divine the ELF-level structure of the file being mapped.

#define mmap_args(v, sep) \
    v(void *, addr, 0) sep   v(size_t, length, 1) sep \
    v(int, prot, 2) sep      v(int, flags, 3) sep \
    v(int, fd, 4) sep        v(unsigned long, offset, 5)
    if (!(prot & PROT_EXEC)) return mmap(addr, length, prot, flags, fd, offset);
    // else... always map writable, non-executable initially
    int temporary_prot = (prot | PROT_WRITE | PROT_READ) & ~PROT_EXEC;
    void *mapping = (void*) mmap(addr, length, temporary_prot, flags, fd, offset);
    if (mapping != MAP_FAILED)
        /* ... */
        if (/* needs trapping*/)
            inferred_vaddr = (uintptr_t) mapping - matched->p_vaddr;
            trap_one_executable_region_given_shdrs(mapping, (void*) mapping + length,
                    filename, /* is_writable */ 1, /* is_readable */ 1,
                    shdrs, ehdr.e_shnum,
            // OK, done the trap so let's fix up the protection (always different)
            ret = mprotect(mapping, length, prot);
            if (ret != 0) goto fail;
            /* ... */
        /* ... */
    /* ... */

Making this work on Linux/x86 (32- and 64-bit) required some “interesting” code in a few cases, over and above the chain loader trick. The current version still has some unsupported features (sigaltstack() among them) but can support a lot of stuff thanks to the following gyrations.

The good news is that things mostly work. In particular, I can do:

$ ./trace-syscalls-ld.so /bin/true

and get “full” output analogous to that of strace, instead of the old output that was missing many system calls.

What about correctness of all this? This bootstrapping approach only gets us 100% coverage of system calls on some assumptions.

Assumption 1 isn't true for gettimeofday() in Linux's vdso, which implements it as a read from kernel memory. No “interesting” system calls are implemented that way. Do we really just mean “system calls”, though? One could argue that fault-handling paths are in some ways like system calls; we can't trace entries and exits to/from these, though neither does strace. Signal delivery is also a bit like a return from a system call, and we don't trace that, in contrast to strace (which gets any event ptrace exposes, and that includes signals). We could trace some of it by installing our own handlers for all signals, though that would require a lot of care to preserve the program's original semantics... it starts to become a full-on “transparency” problem, of the kind DynamioRIO's authors have articulated.

For assumption 2a to be true, we must forbid mappings that are simultaneously writable and executable. Some programs, such as the just-in-time compilers of V8 or PyPy, still likes to create executable memory and write into it. If we allow such mappings, we have no synchronous opportunity to modify code before it becomes executable. My best guess is to allow these programs to carry on with an emulation of writable-and-executable mappings, adaptively switching between really-writable and really-executable and handling the segmentation faults in user space. The overhead would be great if done naively, but some kind of adaptive hysteresis might handle it well enough (even maybe as simple as “flip after N faults”).

A more subtle version of the same problem is that we need to avoid distinct simultaneous W and X mappings of the same underlying object, such as a file that is MAP_SHARED. That would have the same effect as a writable-and-executable mapping. Such a check is easy in principle, but requires tracking the inode and device number behind every existing file mapping. (File descriptors are OK because we can fstat() them on demand.) Attempts to remap the same object should probably return -ETXTBSY. I've left that as an exercise for someone else, for the moment.

Assumption 2b brings the issue of overt instructions versus punny instructions. On x86 there is no alignment requirement for instructions, so we can jump into the middle of instructions and thereby do something quite different than intended, perhaps including system calls. For example, the innocuous instruction

 mov    $0x9090050f,%eax

assembles to

 b8 0f 05 90 90

but if we jump one byte into this byte sequence, we do a syscall followed by two nops. Preventing this is tricky. Perhaps we trust code not to do this. Perhaps we built the program using some heavyweight control-flow integrity mechanism (and can also check that dynamically loaded code satisfies its requirements). Perhaps we are on an alignment-preserving architecture (so, not x86). For x86, once we have a sufficiently general jump-based trampoline mechanism, perhaps itself making use of instruction punning, we can forcibly stub out any instruction or instruction pair that internally contains a system call byte sequence. Doing this is sufficient only if our trampoline itself is guaranteed not to include such instructions. Such a guarantee is quite hard when immediates are involved... we would have to forbid them to contain the bytes 0f 05, for example. The same is true for addresses or offsets, but these are easier under PC-relative addressing (we get to choose where to place the trampoline). Most likely we can rely on such immediates being so rare that if we simply clobber the instruction with a ud2 and emulate it with something like libx86emulate, nobody will notice any slowdown.

Erk. That was a deeper dive than I planned. But the take-home is that early experience suggests this seems to be possible. We can intercept system cals using a chain loader, bootstrappingly, and so get control of the program's interactions across the system call interface. This lets us build all manner of tools rather like LD_PRELOAD, but directly at the system call level and with stronger coverage properties.

[/devel] permanent link contact

Wed, 13 Oct 2021

No more Dr Nice Guy

[I wrote this back in April, at a point when my time at Kent was drawing to a close and the overload factor had been high for a long time. My current situation at King's is radically different! Whether that will last is less clear.]

As a graph theorist might put it, my in-degree is too high. My time and (especially) head-space are scarce resources. Access to me needs to be limited. Ironically, it requires a certain amount of big-headedness to say this.

Big-headed it may be, but failing to attend to this has proven bad for my health. Every incoming request contributes to a sense of being under attack. Avoiding this is difficult when you're a nice person, working within a culture of niceness and an institution with financial incentives to be nice to (especially) students.

From now on, I have to make an effort to put up a protective veneer of non-niceness. Here I am writing some ground rules to and for myself. It will be tough to keep to them, but as I now know, it will be impossible to do my job if I don't.

[/highered] permanent link contact

Fri, 28 May 2021

Role again

Sorry for the pun. Yes, it's supposed to be about rolling the dice. I've recently(-ish) handed in my notice here at the University of Kent. In July I'll be starting a new job (the same job, but) at King's College London.

The reasons for the move are primarily personal. Of course, that doesn't mean they are unrelated to work... in academic jobs especially, work is personal.

The move is not a promotion. In terms of accumulating money and status, if anything it will probably slow me down. So, it's just as well that those are not what motivate me.

What does motivate me? In work terms, my strongest motivation is to make progress with my research. In career terms, it's to find a home, metaphorically speaking. I will leave behind at Kent many great colleagues, and an institution that I'll remain fond of on many levels. Still, ultimately I had to admit that Canterbury didn't feel like home, and neither did the institution.

It would be wrong not to regard that as a failure. Both institution and city have many home-like qualities. I have failed to turn them into an actual home. I've also failed to repay the confidence of the institution and the senior colleagues, who invested in me on the promise that I would do so. I don't feel great about all this. Being an engineer, occasionally a scientist, I suppose I shouldn't mind failing when it teaches me something. But I prefer to succeed... and although I've learned a lot, I'm not sure it amounts to enough for me to succeed next time.

What I do know is that for most of my time in the job I've been putting a cheery face on an unhappy existence. Even before the pandemic, my move to Kent had coincided with both work and life generally becoming a lot less fun and a lot more stressful. My research usually felt like it was barely moving forwards. That does likely owe partly to my incompetence, sloth and/or feebleness of constitution. However, although I was once slapped down for saying it, I'll say it again: the teaching load has been unquestionably high. That's not a function of the number of lectures per year so much as of less quantifiable factors: the huge amount of direct contact with students (often with few TAs to absorb this), the poor design of admin processes (academics are not removed from the loop to anything like the extent they could be), and a heavy reliance on coursework and the like. (There's more; I'll save the complete list for another time.) The kinds of work that give me satisfaction are not just research—they do also include giving good lectures and explaining stuff to students. But in all cases, the satisfying kinds of work have been a small part of what I do, and the ones that feel most squeezed by other things. In all this, the institution has more often felt like an adversary than an ally. I really hadn't been expecting to apply for jobs elsewhere—my tentatively planned “drastic life change” was something like “move to Whitstable”—but when the pandemic hit, my ties to the institution felt weak. The noises coming out of the central university ‘management’ left me feeling more alienated than ever. Suddenly, moving seemed like the right thing.

Will I enjoy things better at King's? Obviously I think there's a chance. It's difficult to feel certain though. While the University of Kent has been going through an especially rough patch lately, no UK university is in great shape. My impression is that the marketisation of HE has been particularly unkind to Kent—an institution born in the optimistic sixties, and firmly of a plucky-yet-small character. It's located in the corner of a wider region not short of powerful “competitors”. Historically, public policy gave institutions room to carve out their own niche, amid an environment of only gradual change. Since the nineties but especially in the last ten years, politicians have cast aside stability and artificially induced competition. To understand the present, you can do worse than look back twenty years and recognise the glacial forces in action. Kent's Research Strategy from 1999 tells the story of those times: despite working harder and doing better, government policy had already placed it on the side of the curve that is deemed to deserve less, in a misguided belief that induced struggle would lead to a better-functioning institution. It hasn't.

Staying with the history theme, I enjoyed reading Graham Martin's account of the University of Kent's first twenty-five years, “From Vision to Reality”. Those years included an especially high-minded first fifteen followed by an apparently strong showing in the next ten, even when university budgets were coming under a big squeeze under Thatcher. Someone still needs to write the next book, covering 1990 onwards, which I would love to read. Somehow, those thirty years somehow went from pragmatic getting-by to a battle for survival. A (very) senior colleague from another department mentioned that when he had started as a lecturer, for undergraduate applicants Kent was “a viable second choice to Oxbridge”. Some of its sibling “plate-glass” institutions, such as York and Warwick, have consolidated that status... Kent definitely has not. I don't know when or whether a change occurred, and perhaps the memory is a little rose-tinted. (I'm curious about what happened around 1994 in the graph on page 11 of a little report I wrote, where the research indicators started going down... not clear it can be explained simply by sudden competition from ex-polytechnics. In any case the data is noisy.)

Another major factor for me has been that Canterbury, although charming, is a small place. I tend to like smallish places and was feeling pretty optimistic when I moved. In hindsight I should have foreseen that there is a world of difference between Cambridge and Canterbury. Not only is Canterbury half the size population-wise, it has less than half of the cultural activity. In fairness, that is mainly because Cambridge punches far above its population-weight on that. Canterbury is a fine place if you're a family type, and it's great for countryside, coastline and lazy weekend charm. For a thirtysomething single person who wants culture and social life, it's not the best. London is just far enough away to be an exertion. Back when concerts were a thing that happened, I lost count of the times I pencilled in my diary “maybe go to London to see X” on some weeknight, but just wasn't up for the trip. Reverse-commuting from London might be a better option, but that entails the double-whammy of London housing costs on a non-London salary. Big-city living doesn't appeal much to me in any case. (I've already moved back to Cambridge, and will become some kind of “part-time commuter” for my King's job.)

Ironically, the university itself is surely a place where its educated thirtysomethings could find community. But the idea that academic staff can expect to find not just “work” but also community in their jobs is increasingly alien. At the departmental level, we in “Computing” (I still hate that name) do very well at being a friendly and sociable bunch. But more widely in the university there are few footholds available. (To an extent, students at Kent also seem to get siloed by subject, which can't be good.) The relatively small numbers of PhD students and research staff at Kent also means there isn't a “long tail” to the age demographic, which might organically mix together lecturer with postdoc with graduate student with undergraduate. Such an organic long tail is of course something Cambridge thrives on... in addition to a considerable institutional culture of community-building. (In Canterbury, as I guess at most “normal” institutions, institutionally provided booze does nor pour forth to anywhere near the same extent.) In former times at Kent, I imagine that the Colleges would have provided something of these kinds of community, though they appear not to done that for some decades now. In fact rumour has it that “they never worked”... but I've also seen evidence that they once worked a lot more than they do now. The closest I get to feeling part of a wider university community is in union meetings, where I am reminded that I do have peers in the wider university, but we're all exhausted by our jobs. Indeed I'm sure I could have made a better go of things socially in Canterbury if I'd had more energy to spare.

Despite all that, I am seeing a few green shoots at Kent. In the new School-in-Division structure, we have better availability of certain kinds of support staff. At the very top of the institution there is a potentially leaner and more academic-focused structure including several fresh faces. The Senate is once again (just about) a majority-academic body that is starting to rediscover its voice. The university is unfailingly keen to reassert itself as a still-“excellent” research university, even if ‘management’ knowledge of how to do so seems patchy. In Computing I am seeing (unscientifically) a small increase in the average quality of our undergraduate applicants. We may even have made headway against the chronic understaffing and duplication of teaching work that have overloaded us during the few years I've been around. That's all good.

However, in common with many institutions, I also see signs of the opposite: the still-expanding reach of an entrenched managerialism, poor judgement in treatment of staff and finances, an ever-increasing culture of top-down decision-making, the creep of corporate culture at the expense of the academic, and an increasing disconnect between ‘management’ and the academic staff who carry out the mission of the institution. Just as some ‘management’ want to get more serious about research, so the same people seem to view this simply as a way to make academics raise an ever-greater share of the funds behind their own jobs. All UK universities are subject to these or similar forces right now. My pessimistic take is that our universities are managing to preserve academic values only to the extent that they preserve academic self-governance and self-administration—that is to say, hardly at all. Even in Cambridge, where academics have both a culture of pushing back and some mechanisms for doing so, it felt to me much like the option take up a cudgel while the artillery rains down. It's unlikely it does more than fractionally slowing the institution's progress in the sector-wide race to the bottom. That helps to preserve its advantage, but doesn't make for a well-functioning institution.

So, in my new job at King's, the most reasonable expectation is an institution with a somewhat different selection of faults. I have to roll the dice and hope that what turns up is a better fit for me personally. It's also a chance to think more strategically about what I want to do with my limited time on this planet. The past few years have been a blur. I've tried to keep too many plates spinning, with the result that all of them are teetering if they're not already on the floor. My health and self-confidence have suffered too. I'd been considering Kent the only roll of the academic dice that I was likely to get, so I'm feeling fortunate to get another one. I mustn't waste it by plodding on with “more of the same”.

I will have to let go of some of my research avenues, so that the others might actually lead somewhere. I need to focus on the ones where I'm most likely to make a difference to the world over the course of a longer stretch of my career. It takes courage to adopt a longer-term view. I will need to care less about covering my backside on short-term “impact” or “track record” issues. In caring less, I will likely trade away short-term productivity for the potential longer-term... so will probably close myself off from future rolls of the dice. I therefore need to make sure that my long-term plan can actually deliver.

I will need to be more ruthless in protecting time for the things that are important, and organising my so-called (work) life a bit more selfishly around what works for me. If I do this right, I will sadly be a less generous colleague who more often does a poor job of certain less important things. This brings an emotional cost; I enjoy doing a good job of things, even the unimportant ones, and am averse to disappointing people.

Another way to see that is that I need to choose the strategically better flavour of disappointment, and get used to it. I have repeatedly let my mind and schedule become overwhelmed by admin, teaching, marking, service, and making myself available to others. That's not a statement of how generous I am, but of how limited my capacity for those things really is. Recently it has become self-defeating, and I've been disappointing myself and others in what really matters—research very much included. It's been a downward spiral that has limited my ability to function in general. The lesson seems to be that in this job you need to declare bankruptcy early. Bad feeling is guaranteed in some form or other. I need to do what is in my own interests, and be hard-nosed about that.

“Nose of steel” is not one my attributes. And more generally, although I consider myself a natural “classic” academic, I'm still not entirely convinced that modern academia really is where my vocation lies, or should lie. I am especially resistant to shaping myself into the mould of “fundraiser–manager”—increasingly the presumed template for “science” academics, in the eyes of universities and funders and governments. Although it might make sense in bench sciences, it is a poor fit both for the work I do and for me personally.

Despite all that, the academic role is emphatically not one thing, to the point that generalisations quickly get annoying. The regrettable “winners and losers” political set-up amplifies even more the extent to which the “same job” can be radically different for different people and situations. It also means that change is possible. It seems worth one more roll.

[/highered] permanent link contact

Wed, 03 Mar 2021

Career thoughts on academia, industry and points in between

In early 2018 I was at a crossroads in my career: academia, industrial research, or somehow go it alone? I was a postdoctoral researcher at the University of Cambridge, but feeling some pressures to move on. This post is a (somewhat edited) time capsule containing some notes-to-self I made about that decision.

What happened is “history”, of course: I became an academic at the University of Kent. However, you wouldn't necessarily have predicted that from reading these notes. I'm still not wedded to the academic path; most of the questions here are still live in my mind. What I am wedded to is my personal line of research; I'm not just going to “go and work for Google” (shudder). Being a computer scientist, the options and trade-offs I face are somewhat different from those in various other disciplines.

(I should qualify: here academia means mostly “UK academia”, which has some particular problems right now... but is, sadly, not too divergent from the rest of the world.)

I have not made that much effort to edit the notes, so they are sometimes staccato, sometimes repetitive, and don't reflect where my thinking has moved on (especially from experience at Kent; in hindsight they contain a lot of Cambridge privilege). They start here.

Industrial research obviously forms a spectrum: some has more of a development flavour, whereas in some extreme cases it resembles academic research. I would want to fall near the latter end, although not necessarily maximally so.

Problems with academia. There are many. Let me start somewhere, with the following.

Government interference, managerialism. I have become jaded and frustrated by these aspects of the academic sector. They are arguably worse than corporate nonsense in industrial research, because they are so needless. Endless political “initiatives” and fad-chasing, in academia, seem less excusable than when arising from a profit motive. See also “overheads” below.

Research councils are unreliable; it's better not to depend on them. However, as an academic one has the option of using them or not; industry funding is also an option, modulo perhaps high overheads.

Unlike most sciences, I don't need big funding from research councils. My skills are in demand; perhaps I could make enough money from consulting work? My corner of CS is one where a small group, or even a focused individual, can do great work; I don't need to pay for a huge lab or huge group.

I have always felt like an outsider in the academic world. I have a habit of rejecting established research programmes, and another habit of attacking cold problems. My work meets with bewilderment fairly often. Many of the people who seem to “get it” come from industry—thinking of my blog, Twitter and industry-talk audiences. Industry conferences are interesting—they make me feel sometimes a peer, sometimes a learner, sometimes an educator. It is an educational environment, but worlds away from undergraduate computer science; food for thought.

Skill set, part 1. I see myself as both a creative individual and an individual creator. Unlike many [natural] “scientists” I do not see myself as a collaborative agent in the service of some a grand research programme. To stick with academia seems to mean sticking with a system where this is frowned upon or misunderstood, partly from how my kind of CS research is different from “science” “research” in the main. (Of course creative direction doesn't preclude “collaborating”, but it has to be a collaboration on the basis of shared vision, not just nuts and bolts.)

Skill set, part two. Context-switching is not my forte. I don't know for sure, but I feel I could easily be crushed by even a moderate teaching or admin load.

Textbook teaching is a turn-off. In Cambridge I already spend a lot of teaching effort on material that I don't even know, that I've never needed to know, that is not where my expertise lies, and that has dubious educational value (never mind training value). In this instance I'm thinking mostly about the tedious way in which compilers are taught here [in Cambridge] and elsewhere. That perhaps comes down to the fact that my strength is not mathematics, yet in CS, at least in Cambridge, the educational material is desperate to make things mathematical wherever it can. I care first for education rather than training. But this is neither good training nor good education for many of the people with relevant skills/interests. (It is interesting that education may be tailored to a skill/interest profile without becoming “training”—or so I contend.)

Textbook rejoinder: I should take care to remember that the textbook stuff did teach me some things. As a beginning programmer, the mechanics of programming were fine, but I couldn't come up with how to solve certain problems: how to write a game or other reactive system (using an event loop), or how to do heuristic rather than brute-force search (I struggled with my “Countdown numbers game” program; I remember marvelling at the pathfinding abilities of later Sierra SCI games); or how to write a parser that worked systematically rather than by ad-hoc rewritings (I remember wondering how Sierra AGI games did this... noticing a theme here). Only by imbibing some textbook-ish wisdom (sometimes from lectures) could I solve these problems.

Academic, non-practical teaching of practical topics: I've wasted too much of my life staring at students' handwritten code on paper not being sure whether it works or not. A very innovative academic course might avoid this using automated testing. But the cost of redeveloping these from scratch is prohibitive if done on a per-institution or per-academic basis... once again, justifying MOOCs (more on MOOCs below). One could argue that a lectureship would allow me to fix that, by designing the courses differently. However, no doubt my room for variation would be limited, and it all comes out of the time budget.

Teaching repetition, teaching overstretch. Lecturers produce hastily-written slide decks over and over again. They throw away their predecessors' and start over, but throw in their own new fresh set of mistakes when they do so (thanks, Philip). As a supervisor I'm tired of reading hasty, sloppy, unclear lecture materials and of seeing older materials thrown away once a new lecturer starts. The set-up of departments asking lecturers to teach courses, which the lecturer then “owns”, encourages this. Might it be better if the institution owns the course? It's hard to say. Might it be better if lecturers own courses but maintain them whether or not the institution asks them to keep giving it? Seems optimistic. Continuity seems to be the important thing: not just of running the course, but in the creative vision behind it. The current system takes liberal opportunity to break continuity.

High overheads. The academic sector has high and increasing overheads on the funding it receives, not unrelated (call me cynical) to its growing managerialism. If someone in industry has an interest in funding some exploratory research with a university, paying 130\% overheads (or whatever) can quickly erase their perceived value-for-money. It doesn't help that (Cambridge) institutional policies on overheads fail to distinguish big/rich companies from small/startup ones.

Availability of eager young helpers. This is where the academic world does well. You get access to some bright young things who can contribute prodigious efforts towards your cause for little or no money, because (sad but true) they're still in a mode of paying money for the privilege of learning. Still, to date I've yet to really make student projects work for my research; I've seen others do it (e.g. Hrutvik's CakeML project), but it seems better in Cambridge than elsewhere simply because strong students are in greater supply. Industrial research does well too in this regard, usually via established programmes of internships, but that works mostly with more advanced students and requires big money.

Libraries, journal subscriptions etc. Any form of going it alone, as some mix of consultant and “gentleman scientist”, would suffer from lacking these, unless I could also keep (some fraction of) a foot in an academic institution. Or maybe physical “local resident” access to an enlightened university's library is enough.

Future of academic CS, part one: does traditional academic learning have a healthy future in general? Does it have a healthy future in CS-style learning in particular? I'm being vague because I start to doubt the value of much of CS teaching—although I should preface that “in Cambridge”. In certain other institutions, it looks more vocational, which is potentially fine, although it's not really my calling and I don't like to see it masquerade as “education”. My working hypothesis is that textbook academic-style CS content can be delivered via MOOCs fairly effectively, so there's no point investing myself in this style of teaching unless I have the appetite to innovate in that space (I don't, yet).

Future of academic CS, part two: two other aspects of “CS education” are less easily dispensed with. Firstly, practical skills; they take practice. Secondly, deeper and longer-term perspectives, which in theory are what academics are good at. These are the sorts of things that (I hypothesise) make my writings and talks appealing to audiences like Strange Loop, Hacker News, etc.. These people mostly have experience under their belt and enjoy material that helps widen their viewpoint. It would be hard to teach undergraduates, meaningfully, the sort of content that I write in those articles. This could all be an artifact of how we have to teach CS “basics” at degree level. Perhaps (I'm guessing) in other subjects one can deliver a more educational degree because students know the basics and are ready to think about the deeper things.

Many things I wouldn't miss about leaving academia. Two are constant pressure to overextend myself, and constant dissatisfaction from doing a mediocre job. (These have “Matt Welsh resonance”.) Another is the expectation of poor work/life balance. I'm okay with the Collegiate Cambridge model of building your life around your work, i.e. deliberately eroding any separation. In fact I think that dedicated creative work benefits from this. But it doesn't mean that one should be trying to work unhealthy hours. Creativity and clear thought don't benefit from that. Modern academic careers are so pressured that it seems hard to avoid this.

The PhD mill, and my moral objection to it: there's a sort of expectation that as an academic I would acquire funding and “acquire” people, typically PhD students (and worse, RAs) to do “my work” (as opposed to just “their work, that I supervise”). Overall I'm not sure I can get behind this modus operandi. As a PhD student I did my own work. That only created trouble for me later, since it's not the normal thing, but still I feel no need to apologise for doing it. Meanwhile, fetishising lineage in research is part of what turns mostly-spent research programmes into self-perpetuating monsters, and promotes monoculture. Even the word “mentorship” makes me feel cynical.

There are some good things about academia. One reason for wanting a steady academic job has been stated to me as a “cushion for the lean years”. But there are other ways to build a cushion; how much cushion is really necessary? Do I expect any lean years? One should expect some periods that area lean in terms of grants/funding. But if I were a consultant, should I expect years that are lean in terms of clients? One would just have to put up with less interesting work, presumably.

The state within a state. Academia offers institutional support. Sometimes, purely public services (thinking public libraries) and/or wider enterprise (thinking co-working spaces etc) provide similar support. Is that enough? An intellectually stimulating environment is something academia can provide, but probably only the top institutions do really well.

R-e-s-p-e-c-t. Respectability of the academic path is another factor; but I reject that, in principle at least (though I'm not completely immune).

Growing a group: an ability to build a group around my agenda is probably a potential good thing about academia, despite the aforementioned distaste. A small group could perhaps be manageable without abstracting too much time away from activity that is nominally research. But is having 2–3 PhD students, plus teaching and admin, really productive and/or enjoyable relative to what I could do by myself (say, on research time derived from consulting income)?

Scaling to more than one person: academia lets one grow small and even medium-sized teams, although not without also becoming something of a manager. A small team is still many times better than a team of one. There is also some built-in direction: a PhD supervisor has some leadership-style influence. Of course I instinctively dislike that the moment it becomes “control”. In other contexts the same dynamic might conceivably fly (or not) for my work... I can think only of open-source as a team-building tactic. Getting an equivalent extent of team-building that way seems hard. Open source has its own problems: even fewer funding avenues, potential for pulling in different directions (but this happens in academia too), project politics, unreliable (and/or physically remote) contributors, unpredictable timeframes, mob rule of online forums, etc..

What do I think about US-style CS degrees and academia? They admit some amount of liberal arts-style breadth, and take longer to teach material [than in Cambridge], including (sometimes) proper practical classes. But their research story is probably still infected by many of the things I don't like: mathematical orthodoxy, more generally the tyranny of many incumbent research programmes that I don't believe in, the gameability of modern research culture, ditto for career incentives, and the faddish government/funding initiatives (is that less bad there than here? unclear).

Some more points for going it alone. I could pursue long-term research in perhaps a more focused, lower-stress working environment—but perhaps lonely and isolating if it's only me there... ideally want a balance. I could be free to tick some personal “living as I choose” boxes in doing so, regarding lifestyle, physical environment and working environment. After years in West Cambridge on a woefully mis-designed or undesigned site, I miss what makes the centre of Cambridge such a good space: space-efficient small-town planning. The University is moving further from this model, partly for understandable reasons, yet its managerial rather than academic leadership means it is failing to develop an acceptable alternative. I like the idea of working somewhere more rural, with easier access to nature. I could vote with my feet in favour of economic localism and other values that large organisations are painfully slow to catch on to.

Reading the page on "rat race" on Wikipedia, it fits academia well. I am reminded that although the academic world attracts a formidable density of very bright people, bright people need not be far-sighted or big-thinking people. In fact I suspect the correlation between these is fairly weak. We shouldn't suppose that adherence to “the system” is a consequence of intelligent analysis and rational optimisation; bright people can be remarkably strongly bound to social norms and mores, even when they have the intellectual capacity to question them.

An institution in mine own image? I am somewhat inspired by the Recurse Center. And I am a believer, in principle, in “if you don't see the institution you want, create it”. Creating “my own institution” really just means doing things in ways that work for me but also in a way that might provide “a home for other people”, i.e. to fulfil the moral duty of an elder. This has some appeal. If I did create a space for others, it would be intentionally small and definitely not mainstream, so it would be fine if it didn't appeal to many people. It would mainly be about enabling people to do research, as well as enabling myself; but maybe it'd have an educational angle too. I am a fan of Cambridge-esque college-style residential establishments, as a way both to build communities and to limit costs (but this may not fly without big/expensive perks, and needs updating anyway). I'd be pleased if such an institution could contribute to a low-wage-economy area (which would keep property costs down; thinking the antithesis of the San Francisco) as long as it was not too far from civilisation. My educational interests are more in substantial life-long learning, than in an undergraduate hothouse or “stamping machine”. Ditto research. How could it work?

[/highered] permanent link contact

Mon, 04 Jan 2021

Chain loading, not preloading: the dynamic linker as a virtualization vector

Suppose you want to run a pre-existing binary program but in some kind of instrumented or modified form. On ELF-based systems, people often do this with LD_PRELOAD: you preload a library that interferes with some C library calls or somehow tweaks the process state at startup (say to install a signal handler). The program itself is (hopefully) oblivious to what you've done, but its behaviour is modified accordingly. Some years back I saw a nice survey of applications in a presentation by Kevin Pulo; there may be other resources out there.

This post is about some alternative ELF-level trickery, that is sometimes a better option for doing this. Here it's shown on GNU/Linux, but any ELF platform should allow it.

If you've tried LD_PRELOAD, you'll know that there are lots of problems and gotchas with it. Your code has to deal with being initialized at an unpredictable time; the C library's binary interface evolves from version to version, so is prone to additions that sidestep your preloads; it doesn't work for programs that do direct system calls (not via the libc) or are statically linked. What I'm about to describe avoids many of these problems, by creating what I'll call an ELF chain loader. This loader must do any program instrumentation at the instruction level, rather than the linkage level as with LD_PRELOAD. So in a sense, chain loading is less powerful than preloading: it's working at a lower level, and all your tweaks must be somehow “installed” at startup, rather than as part of the “dialogue” of library calls. However, there's a pattern we can use to overcome that limitation, so in fact I'd argue the result is more powerful than LD_PRELOAD, although (for now) harder to use.

Suppose you have a command invocation

$ ./myprog

and want your tweaked command to be invocable something like

$ LD_PRELOAD=mymods.so ./myprog

or (nicer)

$ ./mymods ./myprog

In the latter, mymods is a chain loader binary. A chain loader is basically a stub dynamic linker. It doesn't actually do dynamic linking; it's invoked like a dynamic linker, but it just loads another dynamic linker—after doing whatever tweaks and fiddles you'd like it to do.

Dan Williams has written (on a page now sadly disappeared, but WayBack-available) about how to build your own dynamic linker or ld.so. Building a binary of the right form is the easy part. Building one that actually links and runs real programs is a big job, especially if you're using a complex C library (like glibc) or complex run-time features (like thread-local storage). But in outline, a dynamic linker basically looks like this.

  1. do bootstrap relocation
  2. initialize some stuff (link map, TLS, ...)
  3. locate and map libraries, transitively, resolving symbols, applying relocations, etc.
  4. call constructors
  5. jump to the main program's entry point

The chain loader that we'll build is basically just a stub, which will call onward to the real ld.so to handle most of the above stuff. So it is much simpler:

  1. do bootstrap relocation
  2. mess with the process however we like
  3. locate and map the next loader (e.g. the real dynamic linker)
  4. (optional) disguise the fact that we ever existed
  5. jump to the next loader's entry point

If we do it right, the next loader won't even know that the stub loader has run. We can think of a chain loader as a “virtualization vector”. It sets up a (somehow-) virtualized environment, in which the entire original process, from dynamic linking onwards, then proceeds to execute.

(The “disguise” part is interesting. It makes us almost indistinguishable from malware! Virtualization is about transparently subverting a pre-existing process's execution, and there's a large class of malware that also does this.)

A few years ago I wrote a skeleton of a dynamic linker, called donald. It can load only very simple statically linked programs—hardly a real dynamic linker! Nevertheless it's a good vehicle for understanding what goes on during process start-up, and is a basis for building our chain loader. Caveat: it's very quick-and-dirty; it doesn't always use robust coding patterns, and it uses some unportable hacks to keep the code simple. That makes it possible to show almost all of it in a single blog post! Following Dan Williams's recipe, the command we use to link donald is the following.

$ cc -std=gnu99 -g -fPIC -fno-stack-protector \
  -fuse-ld=bfd -Wl,-Bsymbolic -nostdlib -nostartfiles -shared \
  -o "donald.so" premain.o main.o entry.o load.o \
  -Wl,-Bstatic /path/to/musl/lib/libc.a -Wl,-Bsymbolic \
  -T donald.lds -Wl,-soname=ld-linux.so.2 

where we're using a nice simple C library implementation, namely musl, to provide the libc calls that we use internally. The donald.lds file is a lightly tweaked linker script that adds a symbol that'll help us do bootstrap relocation, as per Dan Williams's article. The work of donald is done in the C files premain.c, main.c, entry.c and load.c.

“Bootstrap relocation” refers to the solution to an essential problem: who links the dynamic linker? In old-fashioned static linking scenarios, binaries contain a ready-to-go memory image. The program is loaded at address 0, all address bindings are known, and references have been fixed up when the binary was linked. By contrast, in dynamic linking scenarios, a library might be loaded at any address, meaning it still contains some relocation records describing fixups needed at load time. These are done by the dynamic linker. But what if you are the dynamic linker? The dynamic linker itself is just ld.so, a shared library; it, too, can be loaded anywhere. So who relocates the relocator? The answer: it relocates itself. The entry path of the dynamic linker consists of code specially crafted to run correctly regardless of where it is loaded. One of its first tasks is to locate the table of its own relocation records, relocate itself using them, and then continue execution in a “normal” environment.

On the x86-64 architecture, this is pretty straightforward because most data and code references are PC-relative. (On other architectures it's often much hairier, because it's harder to avoid using any absolute address references before you've completed bootstrap relocation.) Usually the entry path is written in per-arch assembly, precisely to accommodate the less straightforward architectures. In donald, which works only on x86-64, we have the option to write it directly “in C”. It's not really C, because we're bound by a very wacky contract: don't write any code that will cause the compiler to use non-relative addressing modes! The entry point of donald, called directly by the kernel, looks like the following.

/* The function prologue pushes rbp on entry, decrementing the stack
 * pointer by 8. Then it saves rsp into rbp. So by the time we see rbp, 
 * it holds the entry stack pointer *minus 8 bytes*. */
#define BP_TO_SP_FIXUP 0x8
void *rsp_on_entry HIDDEN;
/* This isn't the usual "main"; it's the raw entry point of the application.
 * We link with -nostartfiles. We then define our own main. */
int _start(void)
    /* gcc doesn't let us disable prologue/epilogue, so we have to fudge it.
     * We assume rsp is saved into rbp in the prologue. */
    register unsigned char *bp_after_main_prologue;
    __asm__ ("movq %%rbp, %0\n" : "=r"(bp_after_main_prologue));

    int argc;
    char **argv;
    rsp_on_entry = bp_after_main_prologue + BP_TO_SP_FIXUP;
    preinit(rsp_on_entry, &argc, &argv); // get us a sane environment

    printf("Hello from " DONALD_NAME "!\n");

    int ret = main(argc, argv);

    /* We're executing without startfile code, so returning 0 would not make sense. 
     * Calling exit() brings a dependency on fini_array stuff that we need to avoid
     * since it depends on the startup files. So just do the syscall directly. */
    syscall(SYS_exit, ret);


When the kernel jumps to the above code, the stack pointer is pointing at a bunch of useful information: the program arguments and so forth. But the C compiler will generate a prologue that moves it away from that. So the first thing we do is use a hard-coded hack to “undo” that, and get the “stack pointer on entry”, which we pass to our preinit function.

Using this pointer, preinit can navigate the initial stack to find all that data of interest. The data starts with argc; above that is argv (the vector of pointers to argument strings), above that envp (the same but for environment variables), then the auxiliary vector of records containing “stuff the kernel wanted to tell us” and finally the strings these argv and envp vectors point to. You can read more about auxiliary vectors here. Our preinit parses this chunk of memory, and extracts a couple of facts that we'll need to do our work: the page size and our own base address.

Bootstrap relocation uses the latter of these. We find our own dynamic linking information (symbol table and relocation table) using the _DYNAMIC symbol. But remember “this isn't C”: simply referring to _DYNAMIC may not work yet, because it's defined outside the current compilation unit (by the linker, in fact) so the compiler may emit a reference that needs relocation. By definition, the relocation hasn't been applied yet. In donald I used an unorthodox (and unportable) hack: taking the address of _DYNAMIC in our C code before we relocate ourselves gets us the unrelocated (file-relative) address, to which we manually add our own base address as snarfed from the auxiliary vector. (Better code would probably declare _DYNAMIC as file-local, using the ELF “hidden” or “internal” visibility, which would allow the linker to fix this up into rip-relative addressing.) Either way, we can get to the _DYNAMIC vector, which gets us to our symbol and relocation tables; then we walk the relocation table and apply each reloc.

static inline void __attribute__((always_inline)) bootstrap_relocate(unsigned char *at_base)
    /* We scan _DYNAMIC to get our own symbol table.
     * HACK: we manually relocate &_DYNAMIC
     * by our load address to get its actual address. */
    ElfW(Dyn) *p_dyn = (void*)(at_base + (uintptr_t) &_DYNAMIC);
    ElfW(Sym) *dynsym_start = NULL;
    unsigned long dynsym_nsyms = 0;
    ElfW(Rela) *rela_dyn_start = NULL;
    ElfW(Rela) *rela_plt_start = NULL;
    unsigned long rela_dyn_sz = 0;
    unsigned long rela_dyn_entsz = 0;
    unsigned long rela_dyn_nents = 0;
    unsigned long rela_plt_sz = 0;
    while (p_dyn->d_tag != DT_NULL)
        if (p_dyn->d_tag == DT_SYMTAB) dynsym_start = (void*)(at_base + p_dyn->d_un.d_ptr);
        else if (p_dyn->d_tag == DT_SYMENT) dynsym_nsyms = p_dyn->d_un.d_val;
        else if (p_dyn->d_tag == DT_RELA) rela_dyn_start = (void *)(at_base + p_dyn->d_un.d_ptr);
        else if (p_dyn->d_tag == DT_RELASZ) rela_dyn_sz = p_dyn->d_un.d_val;
        else if (p_dyn->d_tag == DT_RELAENT) rela_dyn_entsz = p_dyn->d_un.d_val;
        else if (p_dyn->d_tag == DT_JMPREL) rela_plt_start = (void *)(at_base + p_dyn->d_un.d_ptr);
        else if (p_dyn->d_tag == DT_PLTRELSZ) rela_plt_sz = p_dyn->d_un.d_val;
    if (rela_dyn_entsz > 0) rela_dyn_nents = rela_dyn_sz / rela_dyn_entsz;
    /* We loop over the relocs table and relocate what needs relocating. */
    ElfW(Rela) *p_rela = rela_dyn_start;
    for (int i = 0; i < rela_dyn_nents; ++i)
        do_one_rela(rela_dyn_start + i, at_base, dynsym_start);
    p_rela = rela_plt_start;
    /* Also do .rela.plt */
    for (int i = 0; i < (rela_plt_sz / sizeof (Elf64_Rela)); ++i)
        do_one_rela(rela_plt_start + i, at_base, dynsym_start);

I haven't shown the function that applies a single relocation record, do_one_rela, but it's pretty simple. In practice it only needs to know a few flavours of relocation.

static inline void __attribute__((always_inline)) 
do_one_rela(ElfW(Rela) *p_rela, unsigned char *at_base, ElfW(Sym) *p_dynsym)
#define SYMADDR(r_info) (p_dynsym[ELF64_R_SYM((r_info))].st_value)
    Elf64_Addr *reloc_addr = (Elf64_Addr *)(at_base + p_rela->r_offset);
    switch (ELF64_R_TYPE(p_rela->r_info))
        case R_X86_64_RELATIVE: // no symbol addr, because we're RELATIVE
            *reloc_addr = (Elf64_Addr)(at_base + p_rela->r_addend); 
        case R_X86_64_64: 
            *reloc_addr = (Elf64_Addr)(at_base + SYMADDR(p_rela->r_info) + p_rela->r_addend);
        case R_X86_64_JUMP_SLOT:
        case R_X86_64_GLOB_DAT:
            *reloc_addr = (Elf64_Addr)(at_base + SYMADDR(p_rela->r_info));
            /* We can't report an error in any useful way here. */
#undef SYMADDR

We're now getting ready to take on the main task of being a chain loader, which plain donald is not (at least not intentionally). Ideally we want to be invocable in two ways: as a bona fide dynamic linker requested directly by an executable, as we see here:

$ readelf -Wl /bin/blah

Elf file type is EXEC (Executable file)
Entry point 0x401432
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000400040 0x0000000000400040 0x0001f8 0x0001f8 R E 0x8
  INTERP         0x000238 0x0000000000400238 0x0000000000400238 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /path/to/my/ld.so]

and as a command-line tool

$ /path/to/my/ld.so /bin/blah

These two cases work out slightly differently at run time. The first is the “normal” case, and the kernel has already mapped both the program itself and us the dynamic linker. In the latter, the kernel thinks it's loading us as a statically linked binary (curiously not caring that we're ET_DYN not ET_EXEC), and it's up to us to parse our arguments, identify the program we want to run, and load it. The contents of the auxiliary vector are slightly different in the two cases, and indeed that's how we'll tell the difference and behave accordingly.

int main(int argc, char **argv)
    // were we invoked by name, or as a .interp?
    // use AT_ENTRY to find out: it's _start if we were invoked as a program,
    // otherwise it's the program's _start
    int argv_program_ind;
    uintptr_t entry = (uintptr_t) &_start;
    _Bool we_are_the_program = 1;
    for (ElfW(auxv_t) *p = p_auxv; p->a_type; ++p)
        switch (p->a_type)
            case AT_ENTRY:
                if (p->a_un.a_val != (uintptr_t) &_start) we_are_the_program = 0;
                entry = p->a_un.a_val;
    fprintf(stderr, "We think we are%sthe program\n", we_are_the_program ? " " : " not ");

Since we're a chain loader, most of what we want to do is load the next loader and jump to it. To load the next loader, we'll need to memory-map some chunks of the file, by reading its ELF header and program headers. We can simply do this by file I/O.

    /* We always chain-load the ld.so and let it load the program. Let's read it. */
    const char ldso_path[] = "/lib64/ld-linux-x86-64.so.2"; // HACK: sysdep
    int ldso_fd = open(ldso_path, O_RDONLY);
    if (ldso_fd == -1) { die("could not open %s\n", ldso_path); }
    struct stat ldso_stat;
    int ret = fstat(ldso_fd, &ldso_stat);
    if (ret != 0) { die("could not stat %s\n", ldso_path); }

    // read the ELF header
    ssize_t nread;
    ElfW(Ehdr) ehdr; nread = read(ldso_fd, &ehdr, sizeof (ElfW(Ehdr)));
    if (nread != sizeof (ElfW(Ehdr))) die("could not read ELF header of %s\n", ldso_path);
    // check it's a file we can grok
    if (ehdr.e_ident[EI_MAG0] != 0x7f
        || ehdr.e_ident[EI_MAG1] != 'E'
        || ehdr.e_ident[EI_MAG2] != 'L'
        || ehdr.e_ident[EI_MAG3] != 'F'
        || ehdr.e_ident[EI_CLASS] != ELFCLASS64
        || ehdr.e_ident[EI_DATA] != ELFDATA2LSB
        || ehdr.e_ident[EI_VERSION] != EV_CURRENT
        || (ehdr.e_ident[EI_OSABI] != ELFOSABI_SYSV && ehdr.e_ident[EI_OSABI] != ELFOSABI_GNU)
        // || phdr->e_ident[EI_ABIVERSION] != /* what? */
        || ehdr.e_type != ET_DYN
        || ehdr.e_machine != EM_X86_64
        die("unsupported file: %s\n", ldso_path);
    off_t newloc = lseek(ldso_fd, ehdr.e_phoff, SEEK_SET);

    // process the PT_LOADs
    ElfW(Phdr) phdrs[ehdr.e_phnum];
    for (unsigned i = 0; i < ehdr.e_phnum; ++i)
        off_t off = ehdr.e_phoff + i * ehdr.e_phentsize;
        newloc = lseek(ldso_fd, off, SEEK_SET);
        if (newloc != off) die("could not seek to program header %d in %s\n", i, ldso_path);
        size_t ntoread = MIN(sizeof phdrs[0], ehdr.e_phentsize);
        nread = read(ldso_fd, &phdrs[i], ntoread);
        if (nread != ntoread) die("could not read program header %d in %s\n", i, ldso_path);

This is all fairly common to any dynamic linker. Now we've collected the PT_LOAD program headers, which tell us which parts of the file to map into memory. To rule out partial failures, we reserve enough memory to be sure we can map the loader in a big contiguous chunk, even though that might be split over many PT_LOADs.

    ElfW(Addr) max_vaddr = 0;
    for (unsigned i = 0; i < ehdr.e_phnum; ++i)
        ElfW(Addr) max_vaddr_this_obj = phdrs[i].p_vaddr + phdrs[i].p_memsz;
        if (max_vaddr_this_obj > max_vaddr) max_vaddr = max_vaddr_this_obj;
    uintptr_t base_addr_hint = 0x555555556000;
    void *base = mmap((void*) base_addr_hint, max_vaddr, PROT_NONE, MAP_PRIVATE,
        ldso_fd, 0);
    if (base == MAP_FAILED) die("could not map %s with PROT_NONE\n", ldso_path);
    uintptr_t base_addr = (uintptr_t) base;

Once we've done that, we have reserved space at some load address which could be arbitrary (but we'll use a hint of 0x555555556000)... all we need to do is turn each PT_LOAD header into the (usually) one corresponding mmap() invocation.

    uintptr_t phdrs_addr = 0;
    for (unsigned i = 0; i < ehdr.e_phnum; ++i)
        if (phdrs[i].p_type == PT_LOAD)
            _Bool read = (phdrs[i].p_flags & PF_R);
            _Bool write = (phdrs[i].p_flags & PF_W);
            _Bool exec = (phdrs[i].p_flags & PF_X);

            if (phdrs[i].p_offset < ehdr.e_phoff
                    && phdrs[i].p_filesz >= ehdr.e_phoff + (ehdr.e_phnum + ehdr.e_phentsize))
                phdrs_addr = base_addr + phdrs[i].p_vaddr + (ehdr.e_phoff - phdrs[i].p_offset);
            ret = load_one_phdr(base_addr, ldso_fd, phdrs[i].p_vaddr,
                phdrs[i].p_offset, phdrs[i].p_memsz, phdrs[i].p_filesz, read, write, exec);
            switch (ret)
                case 2: die("file %s has bad PT_LOAD filesz/memsz (phdr index %d)\n", 
                        ldso_path, i);
                case 1: die("could not create mapping for PT_LOAD phdr index %d\n", i);
                case 0: break;
                    die("BUG: mysterious error in load_one_phdr() for PT_LOAD phdr index %d\n", i);

Now we've mapped the program we want to run, i.e. the next loader. In a real dynamic linker, our work would only just be beginning: we've mapped the next program, but still have to walk its depended-on libraries and load those (transitively), apply relocations in all the above, and so on. But in our case, we're most of the way there. All we have to do is do whatever tinkering is demanded by our use case, then transfer control to the next loader, using the entry point in its ELF headers. The next loader is not just any old program, however. To cover our tracks and ensure it can find the auxiliary vector just as we ded, we also return the stack pointer to where it was when we started.

void __attribute__((noreturn)) enter(void *entry_point)
    fprintf(stderr, DONALD_NAME ": jumping to system ld.so entry point %p with rsp %p\n",
        (void*) entry_point, rsp_on_entry);
    __asm__ volatile ("movq %0, %%rsp\n"
          "xorq %%rbp, %%rbp\n" /* clear rbp to avoid confusing stack walkers */
          "jmpq *%1\n" : : "m"(rsp_on_entry), "r"(entry_point));

In fact, we need to cover our tracks a bit better. If we're being invoked as a command, not as a requested dynamic linker, then we want to make the auxiliary vector look as if the next loader is the one that was invoked as a command too. So we do a pass over the vector to fix it up. One such fixup is required in the opposite (“requested”) case too.

for (ElfW(auxv_t) *p = p_auxv; p->a_type; ++p)
        switch (p->a_type)
            case AT_ENTRY:
                if (we_are_the_program) p->a_un.a_val = entry_point;
            case AT_PHDR:
                if (we_are_the_program) p->a_un.a_val = phdrs_addr;
            case AT_PHENT:
                if (we_are_the_program) p->a_un.a_val = ehdr.e_phentsize;
            case AT_PHNUM:
                if (we_are_the_program) p->a_un.a_val = ehdr.e_phnum;
            case AT_BASE:
                if (!we_are_the_program) p->a_un.a_val = base_addr;
            case AT_EXECFN:
                if (we_are_the_program) p->a_un.a_val = (uintptr_t) &argv[0];

Note that in the “requested” case, since the AT_BASE vector entry is set to the loader's load address, it needs to be fixed up to that of the next loader. If we're invoked as a command, it'll be absent or 0 and can stay that way. The other cases are ones that only need fixing up if we're run as a command, as the vector needs to reflect the next loader's structure not our own (and not the program's, which it only reflects if we're run as a requested loader).

There's more bit of track-covering we can do, and in fact need to, if we're “requested” rather than invoked. One interesting quirk of what we have so far is that the next loader, although its name is /lib64/ld-linux-x86_64.so.2, will think it's called /path/to/my/ld.so, and will create a link map entry accordingly. This will confuse debuggers, e.g. because they will look at the wrong file's symbols. It took me a while to figure out where the dynamic linker gets its own name, but it's obvious in hindsight: from the .interp string in the mapped program binary! (If we were invoked, it more obviously comes from argv.) So to avoid confusion, we want to overwrite the .interp string before we call the next loader. That takes some jiggery-pokery at link time, but we're already doing some of that to be an alternative requested linker—i.e. to link with --dynamic-linker=/path/to/my/ld.so. Linking in a small additional file assembled from

#if defined(__linux__) && defined(__ELF__)
.section .note.GNU-stack,"",%progbits
.section .interp, "aw"

is enough to make the .interp writable, and can also be used to add more padding if the stock path (which we'll be writing) might be longer than our requested path.

There's another wrinkle with this. Since the default linker script will place .interp among stuff that is otherwise read-only, this extra linked file will also have the side-effect of making writable the program headers, various notes and other stuff appearing in the segment. I'm not yet decided on the best way to deal with this. We could mprotect() this segment after we're finished with it, if we're sure we're the only reason why it should be writable. We could skip the above and simply use a pair of mprotect() calls to do the write, at least if we don't need to add padding. We could try to move .interp into the RELRO (read-only after relocation) part of the executable. Or perhaps we could arrange that our ld.so has two .interp-like chunks, one for its own name and one for the delegated ld.so make them page-aligned, and do a remap rather than a write. This would waste some disk space in the binary, but avoid wasting memory in systems where many processes are loaded this way—a common trade-off in dynamic linking.

We could go further with track-covering, such as attempting to unmap ourselves as we hand over to the next loader. That would involve more sneaky malware-like tricks, and is overkill for most purposes.

So why is all this better than LD_PRELOAD? I can think of several reasons.

Why's it not so good?

The “but...” refers to a possible “big-hammer” way around the setuid problem: make our loader setuid root, albeit dropping root privileges super-early. That sounds crazy, but I believe it'd be an interesting demo for software verification. It only needs to be root just long enough to open the target binary and test for which UID to switch down to (either the invoking user or the file owner). The test could be done in a very short code path making probably just two system calls: one to open a file descriptor on the loaded file, and another to fstat() it. On many architectures it could be done before bootstrap relocation. Can we prove that code correct? Doing so is feasible if we have trustworthy formal specifications of instruction sets, system calls and (if needed) relocation. Not coincidentally, I've been involved in some research on these; there is more to do.

In a future post I'll demo a very simple syscall tracer built using this combination of chain-loading and bootstrapping instrumentation. It's basically an in-process strace. System call instrumentation is usually what people want when they use LD_PRELOAD, and are forced to settle for libc instrumentation instead.

Code is available (though definitely not out-of-the-box demoable; maybe by the next post) in my donald repository, and a bit more (mostly to be explained next time) in my liballocs (see allocsld) and libsystrap.

[/devel] permanent link contact

Thu, 23 Jul 2020

Building a simple toolchain extension, the subversive way

I recently ran into a nasty problem, arguably a bug, in GCC. I decided to write a little tool to help me detect when I'm triggering this bug. This post is about how I did that, using some simple but handy helper scripts I've been growing for a while.

The bug is to do with “constructor” functions in GNU C (not to be confused with C++ constructors!). I'll quote from my bug report, slightly edited.

I was surprised to find that my __attribute__((constructor(101))) on a function was
not giving it priority over another constructor whose attribute used a higher number.
It turns out this was owing to an additional prototype that did not mention the
function as being a constructor.

This caused a really subtle initialization bug in my code, and was tough to track
down. At the least, I would expect a warning when the information is discarded.

$ cat test.c
// comment these for correct behaviour, i.e. 1 then 2
static void do_init1(void);
static void do_init2(void);

static void do_init2(void) __attribute__((constructor(102)));
static void do_init2(void)
        printf("Init 2!\n");

static void do_init1(void) __attribute__((constructor(101)));
static void do_init1(void)
        printf("Init 1!\n");

int main(void)
        return 0;

$ cc -save-temps    test.c   -o test
$ ./test
Init 2!
Init 1!
$ grep -B2 init_array test.s
        .size   do_init2, .-do_init2
        .section        .init_array,"aw"
        .size   do_init1, .-do_init1
        .section        .init_array

The attribute's priority argument is being discarded. It should be visible in the
section name (".init_array.0101" etc.). If the extra prototypes are removed, the
programmer-intended behaviour is restored.

I can see how this would happen. Normally, constructor functions are not called explicitly, so don't need to be prototyped. In my liballocs project, however, sometimes my code can be called during very early initialization of the program. This is mostly because it preloads a wrapper for malloc(), which is called by the dynamic linker before it even gets around to running constructors. To deal with these incoming calls at such an early time, the library can be initialized on demand, earlier than the usual initialization sequence. To allow this, several constructors are also protoyped functions that can be called explicitly across module boundaries.

Although I could do a one-time pass over my code to fix the problem, this wouldn't prevent the same problem recurring in the future. Getting a diagnostic into GCC would be a long, slow process, and wouldn't help if I switched to another compiler that takes some other questionable interpretation of multiply prototyped constructors. So instead, I thought I'd write a simple and fairly standalone tool that can detect the problem during build. In short, I want to check that if a priority is given in a prototype, then that priority is reflected in the assembly code; if not, an error is raised.

My toolsub repository, although in somewhat rough-and-ready shape, provides a collection of scripts that let us interfere with the compilation and linking toolchain in useful ways. Most of the tools it provides take mechanisms provided by existing interfaces, such as the GCC driver's -wrapper option for supplying a wrapper script, but make them easier to use. It's tricky to use -wrapper because you have to parse various command lines, and some of these are compiler-internal commands like cc1 rather than the public cc, cpp and so on.

Firstly I want to do a pass over the preprocesssed C code, outputting a summary of any constructors that are declared with a priority. This priority information is what the compiler is discarding; I want to catch when this happens. To fish out these priorities, I wrote a short OCaml program using CIL. This post isn't about CIL, but you can see that tool here; it's about 80 lines of commented OCaml. (To build it, you'll need to use my branch of CIL, because it uses one API feature that I haven't yet contributed back to mainline.)

This tool calculates what I'll call a “generous” summary of the constructor priorities declared in the file—if you gave a priority on any prototype, it will always include a priority in its output. (Currently it doesn't help much with the case where multiple distinct priorities are used. There is also a bug about that.) What GCC does is the “mean” version: unless you're consistent, it will silently drop the priority; this is the problem we want to catch.

We could just report an error if we see inconsistent prototypes, which is the case that provokes the bad behaviour in GCC. However, I wanted to be a bit more “ground truth” than that: can we test what the compiler's output actually says about the constructor's priority? To do that, we need to know how the constructor priority mechanism works, and indeed about how the constructor mechanism works at all. It's roughly as follows.

For each constructor, the compiler generates assembly code to output a pointer to that function in the .init_array section. At run time, initialization code will walk this section and call each pointed-to function in order (this is done by the dynamic linker, if you're linking dynamically). To give constructors a priority, the compiler tweaks the name of this section so that it is something like .init_array.0123, for a constructor with priority 123. It is then the linker's job to sort these into order. The linker script rule that generates the final .init_array section is something like this.

  .init_array     :  { *(SORT_BY_NAME(.init_array.*)) *(.init_array) }

The real script is actually a bit messier than that, but the point is that for init-array sections named with a suffix, we link them in sorted order of their name, so lower-numbered sections will go first. Section named without a suffix come last.

To check that the compiler didn't discard the constructor priority, we need to find out what section name is used in the assembly code. That's pretty easy if we can run our tool immediately after the compilation step proper, i.e. the step that inputs preprocessed source and outputs assembly code. Doing that is pretty easy with the wrapper feature of toolsub. The following shell script fragment does the work; dumpconstr is the OCaml tool I mentioned, which gives the “generous” summary.

# Our check works by observing a single cc1 invocation...
# already-preprocessed source goes in, and assembly comes out.
# We cross-check each. We pass the cc1 preprocessed input
# to a CIL tool that dumps the with-priority constructors.
# Then we check that the assembly output contains the expected
# incantations in each case. If it doesn't, we flag an warning.
my_cc1 () {
    # drop the literal "cc1" argument
    # run the compiler
    # Now do our check... our caller has told us what infiles and outfile are
    for f in "${infiles[@]}"; do
        case "$f" in
                # use the CIL tool to find out what priorities are at source level
                constrs="$( "$(dirname "$0")"/dumpconstr "$f" )"
                # for each one, check the assembly agrees
                while read func constr prio; do
                    if [[ -n "$prio" ]]; then
                        # the symbol should be referenced in a section
                        # named ".init_array.0*%04d"
                        # where %04d is the priority, zero-padded
                        section="$( sed 's/^[[:blank:]]*\.section[[:blank:]]*/\f/g' "$outfile" | \
                            tr '\n' '\v' | tr '\f' '\n' | \
                            sed -rn "/$regex/ p" )"
                        actual_prio="$( echo "$section" | sed -nr "/${regex}.*/ {s//\2/;p}" )"
                        if ! [ ${actual_prio:-0} -eq $prio ]; then
                            echo "Error: inconsistent use of constructor attributes on function $func" 1>&2
                            exit 1 
                done <<<"$constrs"
                # we process at most one input file
            (*) # skip files that don't look like preprocessed C/C++

# delegate to the generic wrapper -- we've set WRAPPER so it won't re-source the funcs
. ${TOOLSUB}/wrapper/bin/wrapper

It works by defining a shell function, here called my_cc1 which is declared to the main wrapper logic via the CC1WRAP variable. The main wrapper calls this whenever the compiler wants to compile one file to assembly code, after preprocessing. The shell function's arguments are literally the command line that the compiler wants to run, and we run exactly that command. But immediately after that, we also run our OCaml tool on the input file (preprocessed C), then scrape the relevant section names from the output file (assembly code), and check the priorities match.

The tool works! I can drop it into the Makefile of the project I'm using it in, by adding

CFLAGS += -no-integrated-cpp -wrapper /path/to/wrapper

and hey presto, my compilation job complains that my priorities are getting dropped, forcing me to fix the bug. I'm providing the compiler a wrapper script, but most of the wrapper logic is generic stuff provided by toolsub; I only had to write the shell function shown above, and the OCaml program I mentioned, to implement my tool.

So, toolsub's core wrapper logic is pretty useful, but there are at least a couple of things I'd like to fix with it. One is that wrappers should not be dealing in terms of the cc1 command, which is a compiler-internal thing. It'd be better if the core wrapper could translate such commands into “documented” interfaces like the compiler's top-level cc command line. It already does this for preprocessing, translating from a cc1 to a cpp command line (er, mostly). For the assembler it's also handled, trivially since the driver runs this directly. It's not yet done for linking, nor for the compiler proper... the shell function above still gets given a cc1 command, albeit assisted by some other logic that has pre-extracted the input and output filenames. That logic itself remains a bit rough.

The other thing I could fix is that arguably, this wrapper logic should not be using shell scripts. They're neither fast nor pleasant. But actually... it's hard to beat them for simple things... I hate how clunky subprocesses quickly become in Python, for example. In the above example, we call out to an OCaml tool to do the analysis of C source. Scraping the assembler file using regex hackery doesn't seem so bad, since the assembler is line-oriented.

My goal with toolsub is to have it become an umbrella project for small utilities, like the wrapper being used here, that help people modify their toolchain's behaviour in ways that are easy to “drop in” to existing builds. Think “prototyping and personal use”, not production; research prototypes are my particular motivation. The scripts are there to help avoid getting bogged down in compiler-specific things, such as compiler internals or details of the command line. Another of the tools in there is cilpp which again uses CIL, this time packaged to enable CIL passes to be invoked as “preprocessor plugins” (using something like -Wp,-fplugin,/path/to/cil/pass.cmxs). Injecting these as arguments is easier than using an alternative driver like cilly.

A final tool I want to mention is cccppp, unfortunately just a proof-of-concept for now, written in the very fun five days I spent at the Recurse Center in April last year. It was motivated by the lack of anything comparable to CIL for C++. I want to prototype tools that instrument C++, but I don't want my prototypes to get caught up in the fast churn of LLVM and Clang internals. So the idea is to provide a simple, fairly stable tool that can “virtualize” C++ code by lifting all the built-in operators, like pointer dereference or array subscripting, into templates that the tool invoker controls. By default, these templates just do the normal thing. But, for example, to instrument array accesses, the invoker can supply a custom definition for the built-in array access operator (as a template specialization); I'm using this to provide bounds checking. The key idea is to work using only source-level rewrites of the C++ code... a relatively stable feature of the Clang tooling API. And it's a source-to-source pass, so as long as the tool can parse your code, you can run any compiler on what comes out. Expressing these source-to-source rewrites using Clang's API is, however, pretty nasty. It still needs more work before it can work on large programs; I'm hoping it won't need loads more.

What's the lesson here? One could still argue that I should just get this problem fixed in the compilers I care about, rather than writing hacky wrapper scripts. But I tend to think the opposite: during development, every systems codebase should be compiled via a hacky wrapper script! It's a good way to add custom sanity checks for properties that your project cares about, even if most others don't. It suits properties that are quick to check, are reasonably syntactic or local to specific code (at least to specific files or functions), justify a “fail fast” error report (e.g. because it's critical to correctness), and therefore are better expressed as a static check than in a test case. Link-time invariants are another thing that liballocs could use some checking of—though sadly I haven't implemented wrapper support for the linker yet. Some other properties, like naming conventions or checking for ABI stability, arguably fit this category, depending on how strict you want to be. Anyway, we'll see if I end up using this facility more often.

After subverting your toolchain, you may also want to subvert your runtime. I'll write more about that in future posts.

[/devel] permanent link contact

Tue, 26 May 2020

Mission and marketing in computer science degrees

At lunch with colleagues some months ago (remember those days?), I provocatively suggested that our course offering would be improved if we eliminated marketing terms from the names and specifications of our taught programmes and modules. Depending on exactly what counts as a marketing term, this might mean doing away with “cybersecurity”, “Internet of things”, “cloud computing”, “big data” (thankfully not currently used here at Kent) and ideally also “artificial intelligence” (once respectable, but no longer). To replace them, there are perfectly good time-honoured and serious phrases for the same areas of study: “computer security”, “distributed computing”, “embedded software development”, “statistical inference” and even “machine learning” (usually what “AI” means nowadays, and while faintly marketing-y, still more descriptive of that).

Although my suggestion was rather tongue-in-cheek, there is an important issue lurking not far behind it. Serious study on any technical topic, even in primarily “applied” areas like CS, is very different from product development. It's not about “learning the latest technologies”. An academic degree should not be a “workplace simulation” or an apprenticeship; it's about developing skills that are more underlying, more general and more timeless. We need to give students the message that university study is about bulding those skills. We need to give ourselves, as academics, the same message. If we don't, and instead fool ourselves that our responsibility is something about “teaching what will help the students get a job” then as an institution we're not fulfilling the very function of a research university, and as individual academics we're undermining the long-term justification for our own jobs.

The paradox is that here at the School of Computing, even our own industrial panel is reported to emphasise, consistently, that they want us to teach principles rather than specific technologies. Certainly our degrees do contain a decent amount of principled content, even if there is room for debate on what fraction of our students come away really understanding those principles. Despite that, currently I'd say we don't give the “timeless skills” message to our students very much. But we do give out the “get a job” message all the time. We do this by endorsing marketing-speak in what we offer, but also by other messages we send far more subtly, pervasively and latently, in what we teach and how we teach it. We award good marks to relatively insubstantial final-year projects whose core is to implement CRUD functionality mimicking commercial web sites or apps, provided they show due allegiance to the hallowed engineering methods we teach. We teach “software engineering” by attempting to rote-instil good practice, but backed by relatively little critical or analytical content, and taught at a stage where we are inevitably asking the students to take this wisdom on trust. (That is undergoing overhaul for next year, but it's unclear whether the new version will change this. If not, I will be partly to blame, as a co-author of the module specification.) We abolished as superfluous a module on “people and computing”, when at least its title captures what ought to be the single biggest topic in the whole discipline. We teach very little that conveys the historical, cultural or political aspects of technology. We continue to downgrade our “unpopular” theoretical content below even a basic minimum level (and I'm no theory zealot). And we let our pedagogy be compromised by the unwisdom of student “feedback”, such as, in response to claims that a module was not “up-to-date”, de-emphasising principles in favour of technologies—a move which ultimately leaves students uninformed and confused, since it necessarily blurs the distinction between what is essential and what is accident.

Of course, in all this we are pushed along by external pressures. One message that circulates loud and clear among colleagues, at staff meetings and elsewhere, is that the market matters: bums on seats are our financial bottom line. This follows through: we find ourselves specifying new modules according to our perceptions of marketability. Let me say that again: we're guided not by actual marketability, but our perceptions about marketability. Maybe my colleagues are more informed than I am, but I don't think many would disagree that we have limited basis for these perceptions. We allow our fuzzy mental caricatures of “the job market”, “industry”, “applicants” and hence “what will sell” to determine what and how we teach.

These fuzzy beliefs are why, for example, I have to respond periodically to the question of “why teach systems, when hardly any of then will get a job writing operating systems?”. Meanwhile, teaching “AI” left, right and centre is not questioned at all. If we really believed in “principles” and degree skills as “long-term” we would be more interested in breadth and balance. Instead we have collectively imbibed the message “AI is big; AI is the future” and regurgitated it into our curriculum development. I've nothing against the longstanding field of artificial intelligence, and I fully expect us to teach from it. But this eagerness to exploit its current trendiness is a symptom of something far more toxic than pragmatic compromise. Putting trend-chasing at the heart of our endeavour, not just on the surface of our marketing materials, undermines the social purpose of a research-intensive university—supposedly an institution for the long-term.

The prevailing line is that this is a financial necessity. That might even be the case. But to me, as a relative newcomer, that seems to be an assumption rather than an established fact. Nominally, the selection of our curriculum still rests with us, the academics, as it should. But the current pressure to be “market-driven” reduces us to headless chickens: base reflexes fill in for a lack of informed thought. Having absorbed a belief that an unapologetically “long-term” degree course is “unmarketable”, we knee-jerkingly propose to do what we perceive to be marketable. Where's the basis for believing any of this?

In my more optimistic moments I believe that for the good of everyone—our students, our academics, and our fortunes as a credible research university—there are things we can and should do to change this. Although I'm not a marketing expert, it seems that what I propose is not about ignoring “the market”, but rather about more seriously “doing marketing”, as distinct from mere publicity. One message we could plausibly try to get across is something like the following. Yes, we will teach you all about hot topics like AI, cybersecurity or whatever else the current trends favour. But that's not all we'll do, and we're offering an education that will last. What we give you is substantial, durable and transferrable.

Trying harder for an approach that emphasises the transferrable over the short-term vocational would be a huge challenge. It is the exact opposite of following the line of least resistance. It would take creativity, skill and effort. Still, it's what we academics mostly believe to be the right thing. I've no reason to believe these efforts would not be forthcoming. What is really stopping us?

The real problem seems to be risk. Proposing any change in this direction brings worried-sounding replies. Trying such a revamp would be a risk—how else would we discover what is marketable than by trying something different? If our intake takes a hit as a result of such an experiment (which would necessarily take several years to run), our political and financial position within the university would suffer. This is why having a university sector that is market-driven and financially straitened is so toxic. Doing the very job your institution was set up for has been turned into a radical feat of “innovation” whose risk is too great to take on. The very purpose of the institution has been redefined as impossible.

Nevertheless, I believe we can do more than we are doing. More cleanly separating our taught programmes' specifications (what we set for ourselves) from their marketing pitches (for applicants) might be one small, simple but possibly significant measure. It would help us make a mental distinction between our mission and our marketing strategy. One way to re-spin that mission might be explicitly to contrast the flawed presumption of “preparing students for industry”; with a more forward-thinking “preparing students to prepare a better industry”. This matters especially in our field because much of the software industry is, regrettably, built on froth and fads. If we want the software industry to be better, we need to turn out students who know how to look for the substance underneath the froth; who have a questioning mindset; who aspire to be more than just hired guns (to borrow the words of Christopher Alexander). That's a tall order, and perhaps many of our students aren't especially inclined to ask these questions... but if universities don't make it clear that substance exists beneath the froth, what hope do we have?

We also need to remember that boom easily turns to bust. “Computer science” is currently attractive to students because software is perceived as a boom industry. In this sense, courting the market seems to work in our favour. But again, this is happening because our applicants think “job” when they should think “career”. We do nothing to correct this. “Cybersecurity? Artificial intelligence? Internet of things? Walk this way.” Our university is collapsing financially and compulsory redundancies are on the horizon in some subjects. But at the School of Computing, complacency rules because “we're doing well” and “this doesn't affect us [much, directly]”. Maybe not—this time. But one only has to go back to around 2008–09 we there was last a threat of compulsory redundancies among computer science academics here at the University of Kent. Something called a “credit crunch” was happening around then. Only a few years years previously there had been something called the “dot-com bubble” which burst in 2000–01, seeing many institutions' CS applications take a sudden dip, one which did not recover for about ten years.

These market lurches did nothing to diminish the importance of what we teach. Yet somehow they had a keen effect on the capacity of our institutions to support that teaching. That is a sign of the unquestionably appalling stewardship of our institutions in recent times, and sadly, in the ten years since, the grip of marketisation has only got stronger. But we cannot entirely blame external factors for our choice to emphasise froth, nor can we rely on what for us are “favourable market conditions” remaining so—and this ties back to the same problem. When an industry is based on froth, spooked investors are all it takes to bring the finances crashing down. The same goes for a computer science department and spooked applicants. What goes around comes around. If we believe that what we teach does not suddenly become irrelevant the moment “the market” takes its next lurch, we owe it not just to ourselves, or our colleagues, but to society at large, to doubt the wisdom of “the market” in deciding what passes for an education. Can't we at least try to make our case, rather than chasing what we think the consumers think they want?

[/teaching] permanent link contact

Wed, 11 Mar 2020

Fund institutions, not projects

[This post follows a previous post discussing changes to UK government research funding, which was itself a follow-up to my earlier “Postdoc myths” piece.]

In my last post I finished by mentioning Alan Kay's favoured dictum that we should “fund people not projects”, and that this has reached the attention of Dominic Cummings in his plans to create an ARPA-like research agency for the UK. In fact the dictum is itself borrowed from J.C.R. Licklider, an early and influential divisional head at ARPA, widely credited as a progenitor of the Internet. I also noted that the point of this dictum is easily misunderstood. Here I'll discuss how this is so, and what I think is a better way to capture its intention in the case of government-funded university-based research. In short: “fund institutions, not projects”.

Consider this well-meaning article which claims to be advocating such an approach (using the dictum in the title of the piece!). It totally misunderstands the idea. It seems to think it's about the question of which ‘faculty members’ should receive the ‘grant money’. Not coincidentally, the post's ideas are feeble—stuck in the paradigm of seeking relatively centralised ways to assess the merits of individuals. Tweaking these criteria is not what Kay or Licklider were talking about. Rather, they were critiquing the very notion of project grants, and consequently the very idea of nominated “leaders” following pre-approved programmes of work directing a “team” underneath. “Funding people” does not mean “funding the empire of professor X”, via grants naming X as PI! Even the article's mooted “fund everybody” assumes a fixed prior notion of who is eligible—“faculty” in American lingo. This inherently fails to address the postdoc issues I discussed in my previous posts. The very notion of “postdoc on a project” is antithetical to Kay's (or Lick's) suggestion. For them it is simply the wrong basis on which to pay people to do research (and remember that a postdoc is by definition an employed, qualified researcher—not a trainee).

My re-statement of the idea, focused on universities rather than (as Kay tends to) industrial labs, is that we should fund institutions, not projects. In other words, devolve the decision-making: universities can hire people “on merit” as they usually aspire to doing, but without a preordained project in mind. This answers a common rejoinder, of “who decides who ‘gets funded’?”. The short answer is: institutions do. They are used to making such decisions: how do you decide which postdoc to hire, or which lecturer [a.k.a. Assistant Professor]? Even in the postdoc case, we like to think that research merit is a major factor. So our answer remains mostly the same, but project-specific criteria are explicitly removed from hiring, and project-specific direction is explicitly removed from the job that even a relatively junior (but post-PhD) person is hired to do. “Fit to the institution” is still a valid criterion of course. Let the institutions attract people who want to make a career there. If the ongoing projects are any good, they'll contribute to them; otherwise, or additionally, they'll come up with their own, and more generally contribute to the “problem-finding”, whose importance Kay also often speaks of. Problem-finding is ruled out if you employ people on preordained problems.

This brings me to my next point: it is far better to spread funds out among institutions, and devolve selection, than to run relatively centralised selection exercises like fellowship schemes. The “fund people” line often encounters an attempted rejoinder, amounting to “fellowships exist”. Some people ask “isn't that funding people? And we already ‘do it’, so maybe we just need to publicise fellowships more?”. That is privilege talking. Of course, research council-funded fellowships do exist, and yes, they are “funding people”. But they are the exception not the norm, and are set up to be so. They are the “prestige case”, and are highly competitive. (And they are, anyway, awarded on the basis of a project proposal!) The vast majority of money paying for the employment of early-career researchers is not funding them on a fellowship basis; it's on someone else's grant, meaning a project someone else proposed. The extreme competition for fellowships—a phenomenon caused by policy, not nature, as I covered in previous posts—means only fellowship applications that are “fully baked” (to borrow the words of Martin Sadler from the aforementioned NCSC RIs' conference) have a chance of being funded. Only those applicants who have received substantial patronage and/or prior funding are likely to have the resources to produce a fellowship proposal that both is and appears fully baked, and get it through the narrow review funnel. The effect is inherently conservative, and again antithetical to the idea that “funding people” is how research at large is carried out.

(More generally, people are often oblivious to their privilege. The people who speak most loudly in favour of fellowships tend to be the people who've received them. That's good for them, and very often these people are great at what they do. But as is often the case with privilege, many are slow to recognise how structural factors have acted in their favour. Sadler's point was that inevitably, polish and patronage become decisive elements in many cases. The way the money is split conspires to ensure that however good the pool of eligible researchers, only a slim fraction will be funded in this manner.)

A slightly more subtle phenomenon is that under a system of funding institutions, many more people will “get funded” in their own right since it inherently involves spreading the money out more widely, building a much wider and flatter structure rather than a “fat pyramid”. (That is rather assuming institutions don't find new, internal ways to subjugate people to projects; but I don't believe our universities have yet become so unenlightened that they would do so.) The goal is not to fund “a team under lead researcher X”; it's to fund more potentially-lead researchers and fewer subordinate ones. I say “potential” because the choice of whether to lead or become a non-leading collaborative partner rests with the researcher.

Fellowships' extreme selection practices, like long proposals, postal review and panels, are far less useful in such a context. Similarly, once institutions are free to hire people as they usually do, by job application—but with more such jobs!—we eliminate a certain fraction of the (hugely effortful) grant applications made by academics, since more work will be achievable with the institution's (increased) block funding. There is nothing infeasible about this; it is exactly the way UK university research funding worked until the 1970s. The total number of research-active roles may well work out about the same; that's an orthogonal issue, in that supposing we hold the budget fixed, the pay distribution could stay exactly the same or could change, as could the salary distribution. Even if the staffing level goes down (i.e. average pay goes up!), I'm confident that the effective research capacity would be much greater, since any shrinkage would be offset by eliminated costs: grant application effort, but also the wastage induced by postdoc-style person/project “compromises”, projectwise fragmentation personnel churn and personal upheaval (“move to another city”) that I've written about previously.

Note also that funding people and institutions in this way does not mean “make everybody permanent”. That misunderstanding arises from the same myth I wrote about earlier: the opposite of “postdoc” really is not “permanent position”. It's potentially fine for early-career research appointments to be fixed-term—if the term is long enough and if the process for renewal or progression is sufficiently lightweight (i.e. definitely not “9 months' funding left; start applying for fellowships!”). Five years seems a sensible minimum for undertaking serious work while living an episode of one's life... and not coincidentally, is what established early-career researchers used to be offered in Cambridge. Going further, in fact, there is an argument that late-career appointments in research roles should also remain conditional on actually being research-productive. An oft-noted flexibility in the current system is that institutions can move academics “sideways”, into teaching and/or admin, when they're no longer research-productive. Increasing institution-centric funding would not diminish that option; it can only increase it, since greater funds would be pooled at institution level.

One more objection that might arise is: are institutions wise enough to spend this money well? My answer is “yes, for now” and again it's because the decision-making is inevitably devolved from the centre. Although many of our universities are run appallingly badly by central administration, at the departmental level academic merit often is still recognised and does still count for something. Of course our means of assessing this are not perfect, and I get frustrated when colleagues resort to “counting papers” rather than weighing contributions. Patronage is sometimes a factor too. But at least in my limited experience, most colleagues still look for mostly the right things.

Finally, it's interesting that Cummings takes inspiration from high-profile “breakthroughs” such as the moon landings, the Internet, and no doubt other things like human genomics. I'd like to sound a note of scepticism that much of the research we really want is going to take this form. In an age of technological plenty, it is wrong to assume that what we “should” work on, in the sense of research that will improve people's lives, takes the form of identifiable “breakthroughs”—and certainly not ones in “new areas” pre-selected by government, whether they be quantum computing, “AI”, or the next fixation of government technocrats. The disconnect between apparent technical progress and improving ordinary people's lives has long been present. (On the subject of moon landings, Gil Scott-Heron's “Whitey on the Moon” comes to mind.) But this seems destined to become even more pronounced. While in the biomedical technologies, true life-improving “breakthroughs” do seem more plausible, I still have an overarching feeling of scepticism—perhaps traceable to Ivan Illich's critique of late-C.20th medicine as primarily enabling survival in an unhealthy society. In general we can learn much from the writings of Illich, E.F. Schumacher and others who have questioned the axioms of “development” and its economics. I'm not a trained philosopher or economist, so if you know other work either in the spirit of these, or critiquing them, I'd love to hear your recommendations. In my actual area of training, I've already been developing my case that advances in software are not clearly helping humanity... but I'll save that topic for another time.

[/highered] permanent link contact

Wed, 26 Feb 2020

Postdoc follow-ups

[I'm on strike again at the moment, just as when I wrote my last higher-ed piece, to which this is a follow-up.]

My last higher-ed piece, about postdoc myths was read rather more widely than I expected. (Thanks for reading!) That has left me with a few things to clear up, and a few follow-up thoughts which I didn't get on to last time.

Firstly, let me qualify: my take on postdoccing is more than a little UK-centric, and certainly doesn't generalise in all possible directions. However, I do believe it generalises to many places outside the UK, perhaps in non-obvious ways. The most contentious question of generality (at least in the Hacker News discussion) was whether postdocs “formally exist”. I gathered that many US institutions offer roles like “Postdoctoral Scholar”, for example. But my point was more about how the regulations of institutions and of funders haven't adapted. Job titles are at best a weak indicator of this, and to see jobs advertised as “postdoctoral X” is not enough to infer that there is any recognised status of “postdoc” in the institution or the wider academy, beyond “paid lackey”. Even in the UK, we see jobs advertised, including at the University of Cambridge, with titles like “Postdoctoral Research Associate”. That doesn't mean the institution has any official position of “postdoctoral” anything; it doesn't. The word is simply added for illustration; it is formally meaningless. Such employees' academic standing has been more accurately summarised as “people who do not exist” (to borrow a phrase from Anthony Edwards's remarks on the history of such positions at Cambridge). The high-level point is that institutions' and funders' processes are not designed around present career structures—where one might spend an unbounded number of years as a qualified researcher of considerable independent potential but not holding a “full” “academic” “position”, however that might be recognisable locally. Advertised job titles are not a good guide to reality.

For the same reason, it's wrong to suppose what's happening is “higher supply leading to lower price”. I've been talking about a degradation of the offering—early-career research jobs being offered on shorter contracts with fewer rights and less institutional status—and it's appealing to suppose this degradation is the result of “universities extracting more value” from the labour pool. But that is factually wrong. Neither pay nor status is re-negotiated on the basis of changing supply. Pay scales are hard to change; university regulations are even harder. To redefine positions at lower pay or lower status is a political act; someone has to pull the trigger on it. That isn't what has happened. Equally, in those cases where we would expect upward pressure we also don't see upward changes: universities and academics often find it difficult to hire postdocs with certain skills they want, but that rarely creates any action to improve pay and status (beyond a regulation-limited amount of salary-bumping), because the relevant political change is mostly beyond the means of the academics who are hiring. A key example is that many institutions face a chronic difficulty in hiring research software engineers. As far as I know, this hasn't driven many universities to reform their regulations. Instead, they have shown a boundless capacity simply to limp along with the problem uncorrected. For the same reason, there's no reason to believe downward pressure actually has much effect in cases of oversupply.

So if it is not a case of rational decision-making by universities in the face of increased supply, what is causing the body of underpaid under-statused researchers to get larger? In the UK at least, the answer is simple: it's the government, stupid. What we've seen is that the relative occupancy of pre-existing pay and status levels has been changing. That change arises not from the dynamic between universities and the labour market, but from that between universities and government. It's not supply and demand; it's poorly chosen public policy, formulated by ministers and civil servants who (as far as I can tell) don't understand research. What does change far more easily than pay-scales and regulations is budgets—what government controls. Hence the degradation is arising indirectly, not via the labour-market mechanism but by external changes to distribution of money between streams, and hence of people among the distinct scales and regulations that those pots feed. In short: for a given level of spending, we are relatively funding more postdocs and relatively fewer “full” academic staff. Note, as I argued last time, it would be wrong to equate the latter with “permanent” positions (or even with “teaching” positions). Note also, as I'll return to, the problem is emphatically not one of “not enough money”.

Career-wise, what were once stopgap arrangements—“spend a couple of years on this contract before a ‘proper’ academic role comes around”—have, creepingly, become the norm. Longstanding regulations and arrangements for “contract research staff” are applied increasingly far beyond their originally conceived uses, to an ever-larger and more ill-fitting body of staff, and over longer durations for each individual. But from the universities' point of view this is a case of boiling frogs, not rational agents. Meanwhile, I don't believe government is doing this deliberately; they're just asleep at the wheel. In fact, they don't realise they have the wheel. The (limited) evidence I've seen, such as government's response to the damning 2002 report of the House of Commons Science & Technology Committee and more recently then-minister Chris Skidmore's confused remarks on the subject (Hansard; tweet with video), is that government imagines it has no role in this, and it's all the universities' doing. But universities' hands are largely tied by how money is delivered. Of course, top-end institutions including Cambridge are culpable for their complacency in failing to challenge government.

Those two streams are “core funding” and “project funding”, which in the UK are known as the “dual support” system. I have a draft of a working paper on this subject, which I wrote as an assignment for a module in my PGCHE. I am hoping to expand it into something publishable; comments are very welcome, but be aware it is very much a draft at present. It is, necessarily, very UK-specific. It argues that the changes have come about indirectly, as unintended consequences of well-intentioned (but misguided) policies going back at least as far as 1981 and the attempt to “protect science” from wider public spending cuts. Later changes, to do with funding capital costs (“sustainability”) and with fairness and transparency of funding allocation (“selectivity”) have exacerbated the problem. The foul icing on the horrid cake is a lurking confounding variable—how much core funding is de facto being used to match-fund project grants that are under-costed.

This latter effect is subtle, and is the aspect most in need of further research. Although the headline data is clear that the block/project split has flipped from 60:40 to 40:60 between 1978 and 2018, the reality is almost certainly more drastic than that because more of the block grant is used as “match” or “top-up” support for the increasings volume of projects that are funded at below full economic cost. My lone data point so far (detailed in the draft article) is that in Cambridge, nearly all of the block research funding is being spent on subsidising project funding, i.e. on allowing it to continue being costed below the full economic rate. That's something my future research must dig into, along with a cohort-tracking study of the pre-92 universities to separate out the effects of debinarification in the early 1990s. To make clear statements about career progression, it'll also be necessary to make corrections for rank inflation: early indications are that it's now easier to get to [full] Professor, but no easier to get to lecturer [a.k.a. Assistant Professor], with consequences for how spending is distributed. Figuring out how much this generalises beyond Cambridge is another goal; my article does include some study of Kent, but so far it's less conclusive. If anyone knows another UK pre-92 university that publishes (or makes available to researchers) good-quality data about its staffing and spending over the past decades, please let me know.

The final thing to remember is that real-terms government spending on research has gone up considerably. Therefore, it's doubly unforgivable that career structures are in such a mess. When people like Sir Leszek Borysiewicz say “we don't have money to create better positions”, they're either ignorant or lying. The scarcity is entirely artificial, created by how the increased spending has gone disproportionately on project funding. This is both directly harmful (projects in themselves are a poor basis for both research outcomes and for careers), and indirectly harmful (projects, being under-costed, soak up additional block funding).

To sound a note of optimism, there are multiple ongoing shake-ups of UK government research funding. One is the mooted creation of an ARPA-like agency. Another is the “rebalancing to the regions” which suggests a brake on various institutionwise preferential attachment effects (discussed in my previous post) that have harmed career structures under the project-dominated funding regime. Both of these shake-ups are being driven by Dominic Cummings—a dislikeable figure to put it mildly, but one whose influence may yet do good in this space. At the recent Research Institutes' Conference organised by the National Cybersecurity Centre, the panel session involved three videos, one of which featured Cummings quoting Alan Kay's dictum that we should “fund people not projects”. I think Kay is exactly right, but it's interesting how often his words are misunderstood, and unclear whether Cummings has understood them. In a later post I'll continue this discussion with some notes on how this can go wrong.

[/highered] permanent link contact

Mon, 02 Dec 2019

Postdoc myths

[I'm on strike at the moment, largely in solidarity with my more precariously employed colleagues, whether hourly-paid or fixed-term or never-endingly “at risk of redundancy”. So it seemed a good time finally to finish and publish this post. I wrote most of it during the final couple of years of my seven as a postdoc, which ended in 2018.]

Lots of things are said, written and believed about postdoctoral researchers that are simply not true. This matters because real policies, initiatives, attitudes and actions are shaped by what people believe—true or otherwise. In this post, I'll tackle a depressingly long list of such myths. (I'm trying to keep this post snappy, but the flip side is that I have left out examples in many cases. For some, I also have hard data. So let me know if you'd like more specifics on anything.)

Myth: postdocs formally exist. In almost all universities I know, formally there is no such thing as a postdoc. In research councils' view of the world, it's the same: there are no postdocs, only “Research Assistants” and “academic staff (lecturer or equivalent)”. This matters because when practice on the ground no longer matches the ontologies on paper, systems become prone to poor outcomes and to abuse.

Myth: postdocs are homogeneous. Generalisations and stereotypes abound, both in writing about postdocs and in commonly held beliefs. This is unfortunate because postdocs are a highly heterogeneous bunch. Lumping them all together encourages wrong stereotypes. When these stereotypes hold sway with funders, institutions and departments, misguided policies result.

Myth: postdocs are all aspiring academics (lecturers). Clearly, some are. But there are many skill sets required in a healthy research environment. If you agree with “required”, then it follows that there should be a career path for all of them. Although there should be, currently there isn't. Once upon a time, the University of Cambridge did embrace this idea and had a variety of offices which had real status within the university, with titles including Senior Technical Officer and Computer Officer, as well as the research-oriented Senior Assistant in Research and Assistant Director of Research. These practices are mostly forgotten, and these posts replaced with lower-status unestablished positions: on the academic side, “Research Associate” is increasingly a catch-all, while on the other, technical officers are far fewer and computer officers are no longer the “peers of academics” that they once were.

Myth: postdoctoral work is “study” or “training”. It isn't; it's actually doing the actual research. I had to grit my teeth when applying for some funding that mentioned “postdoctoral students” in its particulars. Meanwhile, at a publication venue I plan to submit to, conflict-of-interest rules mentioning “advisor/advisee” seem to think there is a “postdoc” version of that. There isn't. At any career stage, we have people we turn to for advice, and people we work with. But by definition, someone with a PhD is a qualified researcher, not a student.

Myth: postdocs are “intermediate” between graduate students and more senior positions like research fellows and academics. The phrase “postdocs and PhD students” abounds. But in a university with many postdocs, the population of Research Associates is on average older and has more research experience than the holders of many flavours of early-career research fellowship. That's not surprising when the latter positions come with time limits (e.g. years since PhD) whereas the former don't. People can be postdoccing well into their thirties, forties and sometimes beyond. The “overgrown graduate students” caricature is wrong, disrespectful and leads to wrong-headed policies. (For a game of bingo, try this New York Times article from a few years ago.) According to University of Cambridge data current on 30th November 2017, of university-payrolled Research Associates and similar, over 40% had more than three years' service in the role, and around 10% of the total had over ten years of service. These numbers are underestimates of post-PhD research experience because they exclude postdoctoral experience at other institutions, and because the “and similar” positions include some of the aforementioned research fellowships which lower the average.

Myth: postdocs are on a journey to “research independence” (but are not there yet). This line is popular with funders, who don't seem to realise that their cause and effect are backwards. “Independence” is in practice a statement of one's funding status, not one's stage of personal development. As the mix of funding, in the UK and elsewhere, has moved further and further in favour of project-based grants, and away from institutional funding, hey presto! We have fewer “independent” researchers—on paper, but not in reality. In the UK, suppressing “independent” status is also a useful tactic for gaming the REF, as long as postdocs always co-author with their PIs. (If they don't, their publications are curiously lost into the REF-ether.) Again, the “paper ontologies” are a poor caricature of reality.

Myth: the opposite of “postdoc position” is “permanent position”. This comes up time and time again, but is completely false, at least in the UK. In all institutions I know of, academics (i.e. “lecturers or equivalent”, to borrow an RCUK phrase) may be appointed on limited tenure. They remain first-class citizens for the purposes I've been describing. Yet the justification for depriving postdocs of any given right or privilege is usually “they're not permanent” (a line often pulled out on-the-hoof, rather than reflecting any real rule). In fact, many postdocs are permanent, legally speaking, thanks to the 1999 EU Directive on Fixed-Term Work. Even those who aren't have a legal right not to be treated less favourably. Sneaky tricks skirting or infringing the edges of these laws are among the many ruses used by universities to keep their research staff dangling.

Myth: postdocs are itinerant, unlikely to be at the University in a few years' time, and/or are otherwise “not committed” to the university. To the extent that this is true, it is circular: funders' policies, and institutions' interpretations of them, base themselves on the assumption that postdocs will move on, and proceed to help make that assumption true. In Cambridge earlier this year, a certain fly-sheet had the temerity to claim that Research Associates did not deserve representation because they had not shown “commitment” to the institution (and that the university was not at all complicit in the underlying funding changes that had precipitated the growth in postdoc numbers; no, certainly not). An academic need not be “committed” to an institution beyond their contractual notice period. But a postdoc who spends years at an institution that can only offer them dribs and drabs of extension is showing a very strong commitment to that institution indeed.

Myth: postdocs are provided for by their PIs, so do not need representation, recognition or autonomy. There is a widespread strange belief that a postdoc's PI will “speak for them” and reliably look out for their interests. Again, this came up in certain governance debates in Cambridge. It is obviously false; a PI is only ever a partial ally, and can just as easily be an adversary. Yet these debates threw up bogus arguments hilariously reminiscent of those opposed to female suffrage—exclaiming in outrage, “they will just vote the same way as their husband!” and in another breath, equally outraged, “they might vote a different way than their husband!”. (Yes, this was a real argument of the day.)

Myth: increase in postdoc numbers is somehow a “natural” phenomenon. It's easy to encounter the belief that some sort of bumper PhD harvest, no doubt caused by a mixture of climate change and modern agriculture, has led to there being “too many people chasing too few positions”, and that is why so many people are employed on exploitative terms. This is an appealing folk theory, but it simply does not explain what is happening. Positions are not created by nature; they are created by money, spent according to policies. Suppose there are many qualified candidates competing for a fixed number of jobs. Of course, the more candidates there are, the more competition for the jobs. But it doesn't follow that the jobs' terms will become increasingly exploitative, such as being offered in shorter term, with less pay, lower status and fewer benefits. That requires a separate decision to degrade the offering. That is exactly what's happened in early-career research, by turning the knob in favour of creating only these lesser positions. Why so? It's the same story as in the wider economy these past decades: maximising the exploitable workforce, concentrating capital (research funds) among relatively few incumbents. Anyone who tries to explain it purely as supply and demand, or even “nature”, is either ignorant or is trying to sell you something. (Perhaps they're Liz Elvidge, “Head of Postdoc Development” at Imperial College London, who has a book to sell? I haven't read it, but based on her NPM 2017 performance, I assume it's peddling nonsense like this.)

Myth: postdocs are pro-postdoc. Nobody in their right mind actually wants to be a postdoc per se, with all that that implies. People mostly become postdocs because they want to do research. If there were fewer postdoc positions but overall a better path for research careers, few current postdocs would oppose it.

Myth: “nothing can be done”, or “there isn't enough money”. This is the academic version of Theresa May's “magic money tree” line, and is equal nonsense. The issue here is not the amount of money, but about how the money is spent. Policy knobs are very obviously available, but are being turned only in the wrong directions. This is a failure at the top, since that's where the knobs are. All this is outwith the control of research councils, who (despite their many gaffes) just allocate the budget they're given in the ways they know how. The blame lies with central government. In 2002, the House of Commons Science & Technology Committee produced an astonishing report which is entirely damning of the status quo and skewers the problems of short-term research positions. Government's response was a case of fingers-in-ears. Sadly the report dates itself by its naive optimism that the EU Directives I mentioned above would help; we now know that they can be bypassed. In the 17 years since, we've had no action, beyond creation of another worthless pile of paper.

Myth: postdocs just want to stay in the same job or city forever, but that's clearly unsustainable. It's particularly easy to encounter this belief in Cambridge. But the number of postdocs in Cambridge is a function of money, not of wishes. What's really unsustainable is piling all the research money into a small number ever-fatter institutions, on terms that permit only junior and short-term appointments. These institutions gain a large workforce skewed towards the relatively young and exploitable. Later these workers face a vexed choice: either be spat out to make room for the next lot of eager young things, or (if you're lucky) project-hop to another exploitative job in the same city or institution. In contrast with typical PhD-age life stages, postdocs are generally old enough to have put down roots, or to want to. Special as Cambridge is, it is nonsense to credit it with what primarily a desire for stability in one's personal life. Funnelling the bulk of research money to a select few institutions, and primarily on a project basis, is the underlying mistake.

Myth: institutions are doing what they can to support postdocs. In fact the big institutions are heavily invested in suppressing postdocs' career development. Our “leading” universities are the prime foot-draggers and non-movers in this game, and it's not hard to see why: their senior academics profit from cheap highly-skilled labour serving their research empires. Government will only change tack if academics speak to it, but those with the controlling voice have a vested interest. Of course, these already-established research agendas are not necessarily the ones most deserving of support. And even worse, project-based funding bakes in huge inefficiencies which harm outcomes.

Myth: increase in postdoc numbers is essential to creating an agile, global workforce of the future. This sort of neoliberal nonsense is popular among administrators buying the usual wrong assumptions of elasticity and fungibility of people—in short, treating people like a commodity. But on the ground, it's clear that this is a poor model of how research works. Thinly sliced short-term project-based funding not only creates poor-quality jobs, making for unhappy people, but also gives poor research outcomes. Despite the (mythical) “bumper PhD harvest”, (we) academics tend to bemoan how hard it is to find “a good postdoc” to work on their Highly Specific Project X, starting at Highly Specific Start Date D. With those constraints, that's hardly surprising. So begin the compromises. Many postdoc appointments are major compromises on both sides. Sometimes it even works out. But the failure modes are many: people don't fit the project or the PI; they become unhappy; they jump ship or jump career. Then someone new gets hired on an even shorter contract! No doubt the leaving postdoc also spent a good chunk of their work time applying for other stuff. As a result of this churn, much time is lost and much research goes unfinished or unwritten-up; this is “academic backlog” mentioned at length by Dorothy Bishop here and in this talk. Many small grants also push absurdly detailed administrative work onto PIs. All in all, it's an insane way to spend our research budgets.

Given all these problems, what should we be doing? I believe we need a substantial transfer back to core funding and away from research councils. In the UK, a little-known story of the last 40 years has been a series of transfers from core funding to project grants. Where forty years ago it was 60:40 in core funding's favour, now the balance is reversed, and the “match funding” demands of 80% FEC makes the effective core funding level far lower. My investigations suggest that this has not been driven by policy (with one notable exception), so much as it has occured through accidental drift (over roughly three further distinct periods). To fully reverse this, and provide true core funding, we must eliminate the de facto use of core funds as match funding for projects. Projects must be costed at 100% FEC, even if that reduces the fundable volume. In any case, that fundable volume should be reduced! The balance must be made up by increased core funds that institutions can use to hire and retain researchers on better terms, not simply to top up their projects. I'll write more about these issues in a future post.

I fear another “accident” is brewing. Among UK academics disgruntled by REF, it's increasingly popular to say that we should just allocate the whole budget to research councils. No matter how flawed the REF, wholesale transfer to research councils would be a terrible mistake. REF has problems, but the dominance of project grants creates far more problems. It is the over-use of projects, with their thin slicing, and not limited overall spending, that has weakened career structures. In any case, academic time spent writing and reviewing low-success-rate grant proposals dwarfs that spent on REF. The only way to provide sensible careers is an institution-oriented and actively redistributive model making proper use of core funding. It does follow that REF or its successor must leave behind its current rhetoric on “excellence” and emphasis on scores (metrics), since these also exacerbate preferential attachment. Instead it must actively spread out core research funds according to a broad assessment of an institution's capacity for quality research, including for potential and planned growth. Research funding should be a question of steady nurture and wise investment, not a laissez-faire market for powerful rentiers to battle over. The UK is fortunate, in that there is (for now) no shortage of research talent wanting to work here. It will only flourish if we can provide not just jobs, but careers.

[/highered] permanent link contact

Fri, 23 Aug 2019

Research travel, climate change, and why we must educate our institutions

Like many academics, I travel quite a lot. I'd rather do so in a way that is environmentally sustainable. Most conferences I travel to are sponsored by ACM SIGPLAN, which in recent years has developed a ad-hoc committee on climate issues.

It's great that SIGPLAN has taken this issue seriously. I do agree that carbon offsetting, SIGPLAN's current recommended action, can be worth doing. The following is my greatly simplified summary of the committee's argument for recommending this action. (For the actual argument, which is far more carefully argued, see the preliminary report.)

  1. Research benefits from high-quality communication, which entails travel, e.g. to conferences. [This is clearly true, for now at least.]
  2. Travel “very often involves flying”. [My quibble in this post: let's not assume this is a variable over which we have no control. In practice, many of us have some choice at least some of the time.]
  3. Flying is dependent on burning fossil fuels, therefore on emitting carbon.
  4. Therefore, as a short-term action it is feasible only to offset carbon emissions, not to reduce them.

To me this comes across as a North American perspective, and overlooks the transport situation in Europe. Within Europe it is largely possible to avoid flying. Here are some conference trips I've made by train (and sometimes ferry) in the past few years. In each case I was starting in south-east England, either from Cambridge or Canterbury. For clarity: many of these were not SIGPLAN-sponsored events, but I don't think that detracts from my point.

Hopefully the above shows that a lot of travel within Europe can be done without flying. So what stops people? My best guesses are the following.

The first step to changing this is to recognise that we live in a “flying first” culture. The power of unfamiliarity and inertia, both individual and institutional, is considerable.

The second step is to challenge it. This means asking for our institutions' support, and questioning policies that appear unsupportive. My previous institution's travel guidance sanctions Eurostar trains only for journeys “to France”, apparently not aware that they also run direct to Brussels. No doubt minds would be blown if I were to describe how by changing trains, you can reach even more countries. I wrote to the page's maintainers in March 2018, but despite a friendly-seeming response, there has been no change as of August 2019. Perhaps I'm optimistic, but I believe that institutions will learn we keep telling them. Flying in Europe mostly isn't necessary. The last time I chose to fly within Europe was in September 2012. Since then, I've done it exactly five further times (one way), Aside from the case of industrial action noted above, these have been where the choice was effectively made by some institution or other: if I'd taken the train I likely wouldn't have been reimbursed.

The third step is to accept taking a short-term hit in time and money. Even if our time overheads are somewhat increased in real terms, we should see sustainable choices as an intrinsic good, and choose them to the extent we personally can (which will vary; it's easier if you have your own travel budget). Not only do academics have a responsibility to society, but also we are unusual in the extent to which we can not only “afford” additional travel time, but perhaps even benefit—given extra time spent on trains or ferries, we might even get more work done! I admit that the above record of relatively little flying has only been achievable because I've sometimes ponied up some extra cash myself—but this shouldn't be necessary.

A fourth step is more political, but is necessary to make that hit “short term”: demand progress! Solidarity around sustainable transport is sorely needed. Even in Europe, loss of political favour has been causing the offering to go backwards. Cross-border train and ferry services are overall becoming more sparse—especially night trains, as I noted above. Sometimes, faster daytime trains are a suitable replacement, but often this is not the case. “Voting with your feet” is the most basic and important form of this, but other things, like supporting campaign groups or writing to your representatives, can make a difference. It's the sort of thing most of us could do more of—myself definitely included. The reasons for the cutbacks I mentioned are complex, but are in large part regulatory and therefore political. (Some useful articles about the DB cutbacks are here and here. Despite this, there are reasons to be cheerful: see here and here, where there is a very nice map.)

Coming back to SIGPLAN: I think there is room to develop practices that encourage SIGPLAN members to use more sustainable travel options where they exist. A positive framing is necessary, which will take some careful thought. I believe a financial incentive is also necessary—but straight-up differentiated registration fees might be hard to administer, and perhaps unpopular.

As one possible positive framing, perhaps SIGPLAN could develop a pool of funds, analogous to PAC, but used to pay top-up contributions for those who choose to travel sustainably and cannot otherwise fund the difference. One way this might work is that after a conference, those who submit evidence that their travel was sustainable would a earn a small rebate on their registration fee. This would be offered partly as a token redeemable against future registrations (benefitting the institution paying for the travel) and partly as a deposit into the pooled contribution pot I mentioned. For the latter, the individual or institution receives no money but gets some kind of “points certificate” (to help gamify things). I note the mooted ACM-directed carbon pricing recently proposed in a SIGPLAN blog post; which would also generate some pooled funds. I'm not yet sure whether it should be the same pool; perhaps not.

In October I will be travelling to SPLASH in Athens. From the UK this is one of the more challenging European destinations for surface travel. But it is certainly feasible and I'll certainly be doing it! The core of the journey is an overnight train from Paris to Milan, a train to Bari or perhaps Brindisi, and a ferry (again overnight) from there to Patras. Let me know if you're interested in joining me or learning more.

[/research] permanent link contact

Wed, 15 May 2019

Research careers in UK Universities: questions few are asking, part one

(Those who follow me on Twitter may have a sense of déjà vu about this, but I thought it worth elevating to blog level. I wrote most of it over a year ago... must get better at timely blogging.)

Back in September 2017 I attended the UK's inaugural National Postdoc Meeting organised by the Postdocs of Cambridge (PdOC) Society. We were fortunate to receive a flying visit from Borys, a.k.a. Professor Leszek Borysiewicz, at that time the Vice-Chancellor of the University of Cambridge. This was fortunate in that it is always illuminating to hear the thoughts of those in a position of academic leadership. It was also unfortunate, in that what he had to say showed a distinct lack of leadership. Alarm bells sounded in what he piped up from the floor during the questions in the previous session, which was about career development opportunities. His contribution struck me an astonishingly ill-considered; I would later tweet-summarise his most clueless moments, of which the first two were during this question session. Freed from character limits, here the wording is elaborated slightly.

  1. “Postdocs shouldn't be allowed to hold grants because then the institution must commit to employ them for the duration; that's too much commitment.” (tweet)
  2. “This cannot be solved by making grants more portable. This would do harm by removing incentives from smaller Universities—whose grants could be poached by competitors.” (tweet)

What was most alarming was the self-assurance with which he conveyed these completely bogus arguments. He continued the theme during his own slot, with another clanger.

  1. “We can't avoid putting research staff on short-term contracts, because we lack the funds to create more academic positions.”
  2. (tweet)

It's difficult to overstate how wrong all this is, and doubly so as an outlook for a supposed “leader”. My responses, again more-or-less as tweeted, were as follows.

To (1) (tweet): why wouldn't institutions be prepared to commit to that employment, if the grant-winning postdoc brings in their own salary, overheads, and maybe others' salaries too? (I'm aware of the alleged <100% FEC issue; is that the reason?)

To (2) (tweet), this naturally equilibriates: grant success makes institution more attractive; some do stay, so expect net gain in research power. The institutional incentive is still very much there.

Point (3) is (tweet) the classic canard. “We can't create more academic jobs” is neither here nor there. Improving researchers' lot is not about funding more positions; it's about how a fixed research budget is spent. Current research politics say, implicitly, that we should choose to employ lots, cheaply and precariously. This maximises the raw number of research staff, but keeps turnover high and skews towards the young and inexperienced. Is this really optimal? It seems hard to believe. But it is quite well explained by the incentives faced by established PIs.

What can we do about all this? The first step is clearly to challenge these bogus arguments. Sadly, the world of higher education is full of oft-repeated lines that talk about policy as if it were a law of nature. The “FEC issue” my tweet alluded to is one such line: that “universities lose money on research, since grants bring in less than the full economic cost”. Although of course grants do not always pay full FEC as costed, it is a nonsense to say we “lose money”. Universities are funded from the public purse precisely in order to host research activities (among others). So the worst that can be said is that each research undertaking consumes some of that block funding, and must be fairly accounted internally. To say “consuming funding” is fine, but to frame it as “losing money” is overlooking the very mission of a university. There is a some argument that the FEC issue is ultimately a construction of accounting; one that (I cynically surmise) is convenient to policymakers and administrators because it keeps academics subordinate.

(I have witnessed FEC arguments deployed by administrations to create the impression “you should be grateful” or “we're doing you a favour”, then used as a devious pretext for cutting the support offered to researchers—increasing the pressure to raise even more funds. That specific issue was around the costs of “training and personal development” which, it had been argued, should be costed on grants now that ring-fenced Roberts money for training courses and the like was no longer available. Of course, such courses had been offered before Roberts money existed, and in any case would merit consumption of general funds since they obviously support the mission of the University. Even worse, creating pressure to raise external funds for such things is hiding a giant double standard: that administrative functions of the university rarely face equal challenges to justify their own costs. What grants are the administration applying for, to pay for their own staff's training and development courses? That's obviously an absurd idea, but my point is that value-for-money on overheads is already highly dubious. FEC arguments allow the administration to vent budgetary pressure back out onto academics—“raise more funds”—instead of creating internal pressures. Given the patchy quality of “training courses” and “career development” sessions I've attended, such pressure could hardly fail to be constructive. But instead, we're expected to raise funds for yet more of the same.)

Let's get back to Borys. There is a general pattern that those seeking to protect the status quo—whether owing to vested interests or plain unimaginative conservatism—often deploy “fear, uncertainty and doubt” tactics. They make a vague argument as to why an alternative “would never work”. I have always found this mindset particularly jarring when encountered among researchers, whose very job is exploring unproven ideas. But it is exactly the tactic Borys was deploying in his second point. To me it appears completely implausible that grant portability would take away incentives from institutions. Poaching is already rife (thanks to the REF, which remains unreformed on this point), But even the biggest institutions are not indefinitely elastic. Maybe indeed a certain fraction of grantholders would be poached, but that is second-order effect and is likely far outweighed by the first-order benefits of increased research income. Meanwhile, it's true that growing inequality among institutions is a problem, but measures to help postdocs receive grants would work to lessen this, not worsen it. That's because current grant-awarding policies contribute massively to the “rich get richer” phenomenon, owing partly to the weight placed on track record. Spreading grant money out further down the career ladder necessarily means putting greater weight on other factors (or perhaps greater intentional use of randomness) which will favour smaller institutions. All this is also presuming a lot about the overall system of project-based grants, which, as I'll note, is far from untouchable.

Borys painted the issue as one of funding more academic positions. That is not the issue at all. The real issue is this: how should we spend the amount we currently spend? It's rhetorically convenient to take Borys's approach, painting this issue as “postdocs making demands”—for more money, or more academic jobs, or the moon on a stick. Then it can easily be dismissed. Most of our “leaders”, like Borys, are invested in the status quo. This isn't a tirade against Borys: as Vice-Chancellors go I think he was not too bad. But even a “good” V-C was happy either knowingly to advance a bogus argument to protect that status quo, or more charitably, to attend a policy-oriented meeting of postdocs without engaging his brain sufficiently on the issues of what better policies might be.

It's not just Borys. Lack of leadership is a sector-wide problem. At the same National Postdoc Meeting, one panel included a Liz Elvidge, apparently “Head of Postdoc Development” at Imperial College London. She claimed that the dire situation is “the nature of the beast”. But it is not nature! It is nurture. It is a consequence of policy; the policies could easily be different. Of course, policies won't change if the people with influence hold opinions like these. It is a perversity typical of the present state of affairs that a reputable institution would create a “postdoc development” role whose de-facto job is to further entrench a system that is actively hostile to such development.

(The notion “professional development” is routinely used as a fig leaf. Institutions love to proclaim their commitment to professional development, to cover for their inaction on the core policy issues. We saw this yet again recently, in science minister Chris Skidmore's speech about “Securing the research talent of tomorrow” It begins to acknowledge the structural problem, but rather than facing up to it pivots straight to patching symptoms—mental health and wellbeing, “career development” initiatives—and generating mountains of worthless paper, namely the Concordat. The Concordat is full of worthy-sounding requests, but it is doomed to be ineffective as long as there is no change to the funding regime. There's no recognition from Skidmore that how the money is spent is not only the cause of the problem, but is something government directly controls.)

Today's incumbents—the ageing distinguished professors who now have the strongest academic voice in shaping future policy—mostly grew up under a different system. Where was the mandate for changing it? I'm still looking—this is a topic I'm investigating as part of my part-time PGCHE studies here at Kent (yes, I'm a student again)—but I suspect the story is one of creeping change without any meaningful mandate. Our funding landscape today is dominated by project grants, rather than personal or institutional grants. But it need not be so, and was far less so when our current senior citizens were starting out. For example, although it is rarely remarked, in Cambridge, until the 1980s it was common for academics to start their career on five-year appointments as “Senior Assistant in Research” or “Assistant Director of Research”; or as “Assistant Lecturer”. These posts have disappeared, for reasons owing directly and indirectly to funding policy changes. I will leave a full exposition of that to a future post in this series, but this 2004 report published by HEPI is a good starting reference. The story in brief is that the split of public research funding between Funding Councils (core funding) and Research Councils (project grants) has shifted substantially over the last forty years—but not on the basis of any clear mandate or strategic recommendation that I've yet managed to find. I'll report back with more detailed findings in a future post.

[/highered] permanent link contact

Fri, 11 Jan 2019

Mildly profane meta-advice for beginning PhD students

A while back I chanced upon yet another “advice for PhD students” article, which cajoled me into finally writing my own. I should mention that I don't really like this sort of article; as a side-effect of this cognitive dissonance, the text below will be somewhat profane.

(Just as many such articles contain unacknowledged US-centrism, so mine contains some UK-centrism. I hope you can deal with it.)

Figure out how to make yourself work effectively. If you're relatively young when you start, a surprisingly large fraction of your PhD could go by before you get there. For me, eating properly and sleeping properly are the hardest things to get right, and (of course) a lot flows from them. It might sound silly, but I was in my third year before I mostly cracked this. (Ten years later, I still struggle from time to time.)

Read. Knowing the prior work means really knowing it; not just the last ten years' worth or the high-profile stuff. To merit the PhD degree, I'd argue it's actually more important to be an expert in your field than it is to advance human knowledge. (Some institutions' regulations even reflect this. Not that that's relevant... in practice, your examiners will be guided far more by the folklore of their particular research culture than by any regulations. There's more about examination below.)

Learn how to give good talks. The easiest way is to give talks, and get frank feedback. Also, attend talks and pay attention to which ones work. (Flip side: even after learning this skill, not every talk you give will be good. Don't beat yourself up.)

Attend talks. Cultivate your breadth as well as your depth. This can be easier or more difficult, depending on the particular academic environment you're in. I was lucky on this one.

Don't let anyone tell you “how research must be done” or “what good research looks like”. There are a huge number of different styles of research, and most people know only a few. The more prescriptive someone's take, the more narrow-minded that person and the more loudly your bullshit detector should be sounding. Example: once I was advised that good work should generate “graphs that go up and to the right”. If we only had the kind of research that does generate such graphs, the world would be a poorer place, no matter how good that work. To this day, there are few graphs in my work (and those that do exist don't look like that).

Find your research community (or communities). This relates to the previous point: even people claiming to be doing similar research and/or have similar goals might actually have very different attitudes and methods. Communities have their own vibes and personalities even beyond those issues. It took me a few years, and a couple of wrong attempts, to find the community for me. I was glad when I did. No community is perfect, of course, and I still feel like an alien from time to time.

Don't let the bastards grind you down. Chances are, if your work is at all interesting, some people will want to shit on it. More generally, the research world attracts a lot of extreme personality types (including narcissists, bullies, hypercompetitive arseholes, manipulative gaslighters, etc.). Learn how to see them for what they are—even the ones who are well-regarded famous people.

Get access to good feedback. It doesn't have to be your supervisor (who, realistically, you might well have chosen based on limited information and/or without really knowing what you were doing). It doesn't have to be one person; it's probably better if it isn't. Despite good intentions all round, I found that my supervisor's feedback wasn't very useful (I have no hard feelings). But I found that through my surrounding research group (groups, in fact) I could crowdsource a lot of extremely valuable feedback.

Know how to disregard unhelpful feedback. Some people won't have anything constructive or insightful to say, but will still be happy to pick imaginary holes, make you feel small and/or make themselves feel clever. Don't let them sap your energy.

Learn how to defend against probing and/or hostile questions. In a talk situation, or perhaps a viva/defence situation, you will get these from time to time. It's rarely a fair fight; often, the hardest questions to answer are the most sloppily formulated or the ones founded on wrong assumptions that remain implicit. There are some tactics for handling these. The simplest is batting them back for clarification or re-statement. Another is offering back the question that you think was intended (and, hopefully, one that actually would have been a sane question). The more expert technique is on-the-hoof picking-apart of exactly what those wrong assumptions were. Some of my most awkward moments as a young researcher were when my lack of skill at this was exposed. So, the other point is not to sweat it when this happens.

Don't be afraid to talk to people whose work you admire—even by sending cold e-mail. Even “famous” people are mostly approachable and are happy to hear if you like their work and have something interesting to say. (And if they're not, then it's their problem not yours.) I was far too timid about this. You never know; it might lead to something. And even if not, there is value in just saying hi. I held back too much.

Socialise your work, and yourself. Get yourself invited to give talks at other universities or research labs. Talk at workshops, give demos or posters; take the opportunities. Although it's possible to overdo this—getting some time for actual work is important too!—it's easy to underdo it, especially if you're shy or not a natural self-publicist. (Sad but true: the more people know you personally, the more citations your work will attract, especially earlier on.)

When communicating with others, learn how to focus your thoughts, and tailor them to the audience. This is partly about having not just one but many “elevator pitches” (or, as we Brits say, “lift pitches”). But it's more than that. One of the things I used to get wrong when talking to people, especially by e-mail, was to let my message degenerate into a brain dump. There's always more you can say... but often, less is more. Select the thoughts or points that you're most interested/excited about, and filtered by relevance to or expected resonance in the recipient. Have faith that the other stuff will come out over time if it's fruitful. There will be a lot of implicit cues, in how you're writing or talking, hinting to the recipient that there is more where that came from.

Learn when administrative rules are not really rules. Even the best institutions will have over-narrow admin-defined ideas about how research students' journey should go. They might assign you to one research group even though your interests span several. They might refuse to fund your travel to a particular event that's considered out-of-scope. Towards the end of your PhD, they might neglect to provide extension funding, or not connect you with the help you need in applying for your next funding. In all cases, the rules often appear inflexible when you read them, but they are often fluid in practice, so be bold enough to ask the question anyway. (One example: I was told I had only a three-year studentship that could not be extended. But I asked for extension money anyway. The rules didn't allow extending me, but they did allow retroactively increasing my stipend, paying me effectively as additional back-pay; this got me an extra 4.5 months of funding.) Usually it helps to be on good terms with the relevant administrators—go and talk to them in person, be unassuming, try to understand where they're coming from and what the real constraints are. It can also help to have a supporting professorial type (again, need not be your supervisor) who will be your advocate and help you poke the administrative gears.

Know how your PhD will be examined, and do what you can to influence and optimise for that. Under the UK system, in practice it is your supervisor who will choose your examiners, but you may have some influence over that choice. Sadly, “examination-friendly research” and “good research” are overlapping but non-equal sets. This is particularly dangerous in the UK system where the examiners are all-powerful, and doubly so if your supervisor's social capital is low enough that the examiners do not see fit to moderate themselves on that account. It's pragmatic to see your thesis as a thing to be examined, rather than a thing you're writing for yourself (even though the latter may make you feel fuzzier and more motivated). This also comes back to the “many styles of research” point: if an examiner isn't used to work of your particular style, you will get a rougher ride than you deserve. Make sure the choice of your examiners is made with due consideration of this; it shouldn't simply be whoever your supervisor randomly bumped into and asked on a whim.

Be sceptical about advice. There's a lot of it about, given freely and often smugly by people who took a single path. This is generally well-meaning, but the people concerned don't necessarily know much about the other paths. Successful people often don't truly know why they were successful. I received huge amounts of bad advice during my PhD, which has made me a cynic about advice-giving and mentorship to this day. (Can you tell?)

Clearly I'm a hypocrite too, otherwise I wouldn't have written this. My so-called career path, though not without its good points, has definitely not been one to emulate. Sadly, I can't blame all of its failings on following other people's bad advice. Still, I hope this article has been different enough that it makes a useful counterpoint to advice in the usual vein.

[/research] permanent link contact

Wed, 17 Jan 2018

How to be a Scrutineer (and a better one than I managed to be)

From December 2015 until October 2017 I served on the University of Cambridge's Board of Scrutiny. This is a high-level governance body that is “the University's chief internal mechanism for ensuring transparency and accountability in all aspects of University operations”. The role of the Board is to “examine the way in which the University is run and to comment on this to the University's governing body, the Regent House”, taking its lead principally (but not necessarily exclusively) from the Annual Reports of the Council and the General Board, and from reports concerning accounts and allocations. My service on the Board was hugely valuable experience to me personally. I hope it was also, in some smaller way, of value to the University.

One of the most difficult things was not so much discharging the responsibilities as figuring out how best to do that in the first place. So I've written the following retrospective notes on how, in my opinion, one might serve on the Board effectively. They are more about reach than grasp. Doing all of them perfectly is impossible. Doing even most of them pretty well is extremely difficult. But it's what I would aspire to if I were to re-join the Board.

Serving on the Board is a combination of learning, doing good (according to the University's mission and hopefully one's own principles) and doing necessary: putting in service that others in the University will benefit from. When I joined, I was rather underestimating the significance of the latter. Doing a service to fellow Regents is the most constructive spirit in which to operate. As a non-obvious example, one of the pragmatic roles of the Board, even in a well-functioning University democracy (alas, currently lacking) is to communicate to the University community things that the administration can't say, whether for legitimate political reasons or for regrettable lack of spine (often the reality is somewhere in the middle). One example is criticising the national government. Another is admitting the University's errors in PR-sensitive arenas (e.g. the North-West Cambridge debacle).

Understand the formal structure of the University. I spent many hours studying various web pages before I began to get my head around this. The “Officers Number” issue of the Reporter, published every Lent Term (currently: this one), is perhaps overkill but useful. The governance web site has a gentler introduction. The Statutes and Ordinances are worth consulting and getting to know your way around (roughly). One of the difficult things for me has been to remember which stuff is in Statutes, which in Special Ordinances and which in Ordinances... it's easy to be scratching your head about why you can't find something you saw previously, until you realise that it's in one of the others.

Understand both the Board's formal remit (as specified in Statute A chapter VII and Ordinances Chapter 1 section 6) and (separately) its de-facto role—including differing opinions on what the latter should be. This has been controversial since the early days of the Board. Perhaps the best reading material is the Remarks at the Discussion of the Fifth Report, which showcase the whole spectrum of opinions. Gordon Johnson had previously (at the Discussion of the Fourth Report) espoused the view that the Board should not write a report, simply lead Discussions. Among his concerns was that the University's “officers... devote a tremendous amount of time and effort to meeting the Board, providing it with information, and answering its questions”. At the next year's discussion, then-Chair Michael Potter countered that claim, after which, Johnson laid into the Board's newest Report quite ferociously, reiterating his views that its approach is wrong. Richard Stibbs countered again with the view of the Board as “interested amateurs”, in both senses of the latter word. This is a view I personally find to be realistic, again in two senses. None of this is very conclusive, but it makes clear the spectrum of opinion and strength of feeling.

Understand the Board's history. Johnson was part of the Wass Syndicate on whose recommendations the Board was created. In the aforementioned Discussion, Johnson is the most forthright in this discussion; he is not necessarily correct. I personally found it very odd that he considered it the Board's responsibility to “to explain more in layman's terms the background to what was being proposed [in the Annual Reports of the Council, General Board and of Allocations from the Chest] and the reasonableness or otherwise of [their] content”. I see no reason why these reports cannot and should not be made readable to the lay-Regent within themselves, including their own background. And of course, the reasonableness of their content has very much been the subject matter of every Board's Report. In the present era, where these Reports contain more spin than ever, it becomes ever more important to highlight their omissions rather than explain their merits. The CAPSA debacle is often cited as one of the Board's major successes to date, since the Board was part responsible for instigating the Shattock and Finkelstein reports. (I must admit I have yet to understand by what mechanism the Board did this.) This Discussion of March 2003 is good too, covering post-CAPSA attempts at governance reforms which were rather mishandled.

Obviously, read the Reporter every week. Slightly less obviously, read the Council's minutes every month. Talk to people who have been through the political mill. But also, talk to the rank-and-file, such as colleagues in other departments. Talk to as many different kinds of staff and student as you can. I did not do wonderfully at this. One thing in that spirit I did, for a period, was to work from the Combination Room one morning a week; this was partly for focus, and partly in the hope I would overhear information about what was concerning the rank-and-file admin staff at the Old Schools. I'm not sure I overheard very much of interest, but I did get a sense of what I could (somewhat blandly) call the “different culture” that operates in the Old Schools and UAS worlds compared to elsewhere in the University.

Obtain minutes. Where minutes are not made available (whether publicly, only to the University, or only to the Regent House), consider finding out why not. For example, in August 2017 I noticed that the Council's Resource Management Committee appeared to have a practice of publishing its minutes online, but also hadn't archived any since 2012 in its online archive. Was this because the committee did not meet? That seemed unlikely. Had it changed its policy on publishing minutes? There was no indication. I queried this with the Registrary's Office; for four months I received no reply, until I re-sent my reply and (not coincidentally) spoke at a Discussion (not on anything related, I might add). I then received a reply saying roughly “we can't do anything about committees choosing not to publish their minutes”. But, coincidentally I'm sure, a big pile of RMC minutes did appear in the archive at around the same time. If there's a moral to this, it's that perseverance and a certain amount of pig-headedness are necessary to help ensure diligence is applied where needed.

It's not just minutes. Other widely-circulated documents can be telling about what is going on centrally. The University's Risk Register conveys a lot about current priorities, and is available online (internally only). The annual Planning Round guidance is also telling, and usually appears online too. (In general, “following the money” is illuminating.) There's probably others that I'm overlooking here.

Make the Reporter and collected minutes searchable, in a way that works for you. In my case this means saving them on my laptop and converting them to plain text. Automating this is not too hard, because the web pages have a fixed structure. (Ask me for my Makefile!)

Know how to juggle hats. When interacting with University staff to whom it might matter, be clear whether you're acting for the Board, or personally, or on behalf of some other group.

Keep at arm's length from the Council, but have your sources. The Board's main sources of information on Council business are of course the Proctors, but they often have reasons to be discreet. A lot of University business can easily pass you by unless you go out of your way to hear it.

Protect the Board's neutrality and perceived neutrality. Avoid appearances of conflicts of interest, even if there is no conflict. For example, in hindsight, it was a mistake of mine to nominate a candidate for election to Council while I was serving on the Board. Even though there is something of a a revolving door between a subset of the Board and a subset of Council, it is best to avoid being so nonchalant about this as to plant doubts in people's minds. (As an example of something less blatant: one of my nominators for election to the Board later joined Council part-way through my term. If anybody had been paying sufficient attention, this might have been perceived as a conflict of interest, but the chances are much lower. I hasten to add there never has been any conflict on that account; the question is one of appearances.)

Develop a mental model of Council's operation, and check it with actual Council members. One thing that I learnt only quite late was the complex dynamic between the Registrary's Office and Council. The Registrary's Office has huge power over the University, since it does much of the “real work” behind business that is ostensibly done by the Council. For example, as I understand it, responses to Discussion Remarks are drafted by the Registrary's Office, as are parts of the Annual Reports of the Council and General Board. The influence of that office is huge. On certain occasions there is tension between members of Council and that Office; usually the “official story” communicated to the University, such as in the Reporter, reflects the Registrary's Office's side of any such contended matter.

Understand the HE landscape beyond the institution. Initiatives in the Old Schools often have their origins in sector-wide movements, either directly provoked by government policy or indirectly emerging from ongoing cultural shifts. A good example is the “100% FEC” issue: the claim that the University loses money on research. If I had read this article in the Times Higher Ed, or these slides, I might have pressed certain people harder when this line came out of their mouths.

Understand personal dynamics, and compensate. Like any committee, not everyone on the Board will be equally engaged or equally effective at any given time (and for various good reasons, too). These need not be in proportion to the loudness of their voice or their apparent authority in the meeting room. Compensating for this is part of the Chair's job; but everyone can help.

See membership of the Board as both an individual and a collective matter. The Board will never be of one mind. Signing off on the same Report is the basic standard of consensus. Beyond that you should not expect necessarily to have the same priorities or same position as your fellow Board members. At the risk of displeasing the Chairs I have served under, who did not hold with this policy, I believe individual Board members should be free to exercise their right to see documents—as long as this is with a valid concern in mind. I don't see why any member of the Board could not be trusted to use this privilege reasonably. Seeing this privilege as only a collective Board matter seems to me to close off the opportunity for individuals to follow their noses. I must admit it is not clear to me to what extent there was a statutory intention to empower Board members as individual rather than just collectively. But at least, I perceive that individual dimension is recognised in the case of Council.

Don't be collectively timid. During my time, I felt we were too timid on certain occasions. As one example of a potential timidness that Council and/or the Registrary's Office may play to, the Board should never accept redacted documents, unless the redacted content is “irrelevant”—the latter being the only statutory grounds for withholding information from the Board without the VC's written permission).

Don't be ashamed to have issues you care about. Obviously there are some relevance criteria. Actions of the Board should be in the interests of the University, within the Board's remit, and allocate attention proportionate to issue's importance. But if you believe something is important by those criteria, it probably is. Don't be afraid to raise it even if other Board members appear uninterested. Put differently: being a diverse cross-section of the Regent House, different Board members will care about different issues; the intention of having a twelve-member board is to obtain a cross-section. Conversely, this means allowing time to other Board members' concerns even if they don't resonate with one's own experience.

Develop a habit of contributing to the Board's ongoing recruitment and succession. The Board is continually struggling to recruit. Many people seem not to care about whether their University is governed democratically and accountably. Many do care, but haven't yet realised that the Board might be a good way to act on that. Every Board member should take care to notice anyone who seems to be an Interested Sort of Individual. Find them and encourage them to serve. Note that filling the spaces is not enough; the Board will only be in good health when elections are contested. Having a choice of membership and some effective choice of officers (Chair and Secretary) are important. (I write that as a Secretary who felt somewhat roped into it, on the grounds that “nobody else will do it”. In hindsight I think it was a positive experience, but it was unwelcome at the time, and I don't think I'm being uncharitable to myself in saying I was not the ideal candidate.)

Don't believe that everybody knows what they're doing, or that you do. When I started on the Board, things were done a certain way; it had its benefits but it also had several drawbacks. In my second year on the Board we changed a lot of things, generally to the Board's benefit. I suspect several logical continuations of those changes will come about in the next year. If I'd known earlier how much flexibility there was in how the Board conducts its duties, I might have helped push some of those changes along sooner.

Doubt people, charitably. This extends to oneself, one's fellow Board members and also to the many senior figures whose work the Board scrutinizes—who are grappling with very difficult and demanding jobs. However formidable they are as individuals, they will invariably be grappling imperfectly and may even be happy to admit as much. Among its many other modes of operation, the Board has the potential to help people out, by nudging the gears into better alignment. (That is not to say that raising the red flag or kicking backsides are never appropriate modes of operation....)

Be cooperative where possible, but don't be a pushover. One of the trickiest and most controversial aspects of the Board's operation is the question of how much it should be a savage bulldog, and how much it should be a conciliatory force. At different times, both can be necessary; good taste is required. This is a difficult balancing act.

Believe in the Board's importance, so that others will. I was quite surprised at my first Board meetings to detect hints of nerves on the part of some interviewees—people who were very senior figures in the University. But also, in those early days, we had one instance of a senior figure “shutting us out” owing to accumulated bad relations. The Board's power may be limited to giving voice to issues and arguments, but that power is still considerable. There is much latent capability for doing good, but this is slippery; harnessing that capability effectively is difficult. If the Board acts and acquits itself with fitting diligence and seriousness, including a sense of realism about what options are available to Council, it can be taken seriously by others. It stops being effective when its demands are impossible, or when it seeks to direct attention where it is ill spent. In an era when both Council and ordinary Regents are ever more disempowered by central managerialism, a strong Board is essential to elevate the University's supposedly democratic business from mere theatre to the actual rule of good sense. During my time, the Board was an effective body only intermittently. Hopefully it can be made more consistently effective; the University needs it more than ever.

[/highered] permanent link contact

Tue, 09 Jan 2018

Undergraduate admissions in “computer science”: a plea

Dear Bogdan,

Thank you for this update on our undergraduate admissions testing, and very well done on what must have been a mammoth effort. Seeing the CSAT happening at the CL, I could see it did succeed in creating a friendly environment, and I was impressed by the organisation effort which had clearly been necessary to put on such a smooth operation over a fairly long period.

However, let me also sound a note of alarm.

Reading the CSAT web pages, it is (still) not even mentioned that the test is about mathematics. Rather, it is left implicit—as if it didn't even occur that it might be necessary to mention or justify this. The Frequently Asked Questions page repeatedly use phrases like “at A-level”, “at school” without even seeing fit to mention that school mathematics is what you're talking about. Your starting assumption has apparently been that mathematics skills are 100% of what we are looking for.

I believe this is completely wrong, and it really worries me that you are apparently not thinking about it at all. Computing needs a diversity of skills and viewpoints. It is a horrible mistake to assume that it is simply a question of mathematics. I've personally known many computer scientists whose interest in (natural) languages, philosophy, history, biology and other sciences has been far stronger than their interest in mathematics. Meanwhile, “systems thinking” is a large part of many areas of computer science and requires analytical skills in human domains as well as mathematical ones. Why can't we admit people based on their excellence in those areas?

I am also incredibly worried about the withdrawal of the 50% option, which will further narrow the horizons of our students. As we admit a larger cohort of students, we should be acquiring the necessary support staff to scale up our offering's diversity. We should not be perversely restricting it on the grounds that this is the only way to cope without any extra staff. Our admissions and intake are booming; we should be coming out with a better offering than ever. You might counter that few students have chosen it the 50% option this year, but that only points again to the narrowness of our selection criteria. (I realise that this decision was not yours, but am raising it because I'm sure you have more oversight of this than I do.)

Cambridge will not suffer directly from these mistakes, because we will always admit smart individuals and turn them out even smarter. But our discipline will suffer, in practice and in research. Consequently, society will suffer. I need hardly remind you of the huge power that technologists have acquired in recent years, and the cluelessness with which many wield it. This is one of the huge social problems of our time, and Universities are far from blameless in it. Technology shapes our society. Study of the historical, human, psychological, biological and other topics all provide necessary and complementary viewpoints. It is a recipe for disaster to make computer science a discipline of mathematical thinking alone.

I hope (albeit without optimism) that this will make some sense to you. I should mention that I have also posted this e-mail on my blog. Yours in desperation,


[/teaching] permanent link contact

Wed, 18 Oct 2017

Some were meant for post-hoc reflections

It was great to get so much positive reaction to my essay “Some were meant for C”, so thanks to everyone concerned. In fact, thanks to everyone who's read it, even if you didn't like it! There were lots of things the essay didn't cover, and a few which it did only on a way that perhaps wasn't very obvious. And there were a few common patterns running through various reactions. All these seemed worth addressing directly in a few paragraphs, which I'll hopefully keep brief.

What about C++? Firstly, it's true that C++ offers the same “communicativity” benefits as C. We all know C++ was designed, and continues to be designed, to support systems programming (among other flavours). So, the reason I focused on C is not that C is unique in this benefit; but simply that C is the more paradigmatic example. (Of what? Well, of systems programming languages in the “unsafe school”... see also the next paragraph.) C++, on the other hand, is more things to more people, so would require a much broader and more discursive treatment than my essay had space for. For the record, I use C++ quite a lot myself. When I use it, it is (I hope) because it is the better tool for a task. These tend to be algorithmically complex tasks, and tend not to be systems-programming ones. I wouldn't say I enjoy using C++ as much as I do using plain C, but that is largely a consequence of the tasks concerned (I hate algorithmically hard coding!) and not the languages. More generally, I think it's uncontroversial that the huge complexity of the C++ language is a source of both strengths and weaknesses.

What about Rust? If C is a paradigmatic example of oldskool languages in the “unsafe” systems programming language, then Rust is surely a paradigmatic example of a new school. Or is it? Actually, the benefits of Rust are pretty much orthogonal to everything I talked about in the essay, although I agree that this fact is non-obvious. Rust's selling point is its static checking, which I certainly agree is neat. This particularly suits critical systems programming applications that want a high degree of static assurance while avoiding run-time baggage. However, this static checking comes with a similar “must oversee the whole world” requirement as do managed languages—albeit overseeing it at compile time, rather than at run time. When using Rust for systems programming of the communicative kind that my essay talks about, one is dealing with foreign objects not defined within the Rust language; therefore one is necessarily using what I call “Unsafe Rust”, i.e. the subtly different sublanguage one gets inside unsafe blocks—at least within some “interface layer” in the code. (Of course one can build abstractions that are checkably safe assuming correctness of the unsafe interface layer.) In this respect, Unsafe Rust has the same properties as C: the dynamic safety that I talk about in the later sections of the essay would be equally useful, and equally possible, within Unsafe Rust. (That would make it Dynamically Safe Yet Not Statically Safe Rust—a catchy name indeed.)

Are you proposing a “safe subset” of C? No. The whole point is that no subsetting is required. That is a consequence of how C is specified; to assume that C “specifies” its own unsafety is a confusion, albeit a common and understandable one, between specification and implementation. If you want to get super-technical about the C language specification, you could push me to “almost no subsetting”, since there are some tiny corner details that might need to be tweaked to get the best implementability and efficiency trade-offs. One key area I'm thinking about is the rules to do with effective types, which I'd rather had a tiny bit more ceremony attached. The same goes for unions. Still, these changes would be working at the same scale as the present language standard's evolutionary process; the essence of the C language, and the deliberately underspecified style in which it is defined, are amenable to a fairly efficient, dynamically safe implementation.

So you think C is wonderful, right? Not at all! C has plenty of flaws and I'm more than happy to point them out (er, some other time). I don't think anybody out there believes C could not be substantially improved if we were to redesign it with several decades' hindsight. But my point is rather the converse: that we should recognise the non-obvious ways in which we might get these redesigns and replacements wrong, and recognise why people continue to use C despite today's candidate replacements. Among these reasons, the one I chose to expound on was communicativity, because it is among the most non-obvious (to me, of those I know). There are other reasons why C is an effective tool in ways that certain other languages aren't—one respondent helpfully pointed out that the preprocessor allows certain cases of portable code, for example, which is a very true and important observation. Overall, any attempted replacement must preserve (or ideally improve on!) all these strengths. And of course, it is worth re-emphasising a million times: this is not simply a matter of performance.

What about... (some other language)? Although I'm always interested in and (usually) sympathetic to new language designs, one of the main messages of the essay is (or was supposed to be) that we give too much attention to languages. So, asking me about languages has been a somewhat frustrating pattern in the responses I've got. It's understandable to some degree. As humans, we like to use labels to partition the world in front of us (into species, tribes, taxonomies, etc.). Languages are a readily available label, and the differences among them are very visible. But sometimes this desire to label and partition leads us astray; the point about how we confuse C with implementations of C, on which the essay labours at some length, is a key instance of this. So forgive me if asking me about this or that language makes me a bit grumpy, even if that language is really nice.

Aren't language wars a waste of precious time and energy? Yes! I've already hinted at my distaste for language wars in the previous paragraph. Still, I suspect this reaction came from people who read the essay's title but not its content.

[/research] permanent link contact

Sun, 17 Sep 2017

Project suggestions (for Part II and MPhil students)

I have several project ideas to suggest this year. Any of them I could be available to supervise, for the right student (and not too many!).

A caveat: most of these are researchy project ideas. Part II students should read the pink book carefully, and also read my advice about Part II projects from a student's perspective, including the potential dangers of getting too researchy. Nowadays as a researcher, I feel a strong incentive to suggest projects that verge on the over-challenging or overly research-led. We should discuss carefully any project idea before writing the proposal, to ensure it's within your interests and (projected) ability.

See also my rather cryptic and occasionally out-of-date list of other standing suggestions.

An observable OCaml

This is about an OCaml-to-C compiler that allows the generated code to profit from DWARF-based debugging support of C compilers, while still providing a coherent OCaml view of the debugged code. A very nice job of this project was done last year by Cheng Sun, but there is plenty of room for variation; please ask me.

A fuzz-tester over API traces

This is renamed from “DWARF-based”. I've done some more thinking about this project since the linked write-up a few years ago, so let me expand. Most fuzzers work by perturbing the data given to a particular piece of code. This project is about fuzzing library interfaces by perturbing the trace of calls. This may include fuzzing data, but is not limited to it.

A random test case generator for linkers

(This is probably more of an MPhil project than a Part II project, though it could be attempted as either.)

Tools like Csmith have found considerable use in testing C compilers. A similar tool could be useful for linkers, which continue to be buggily extended and reimplemented, e.g. in the LLVM project's LLD linker.. This project would be focused on ELF-based linking on Unix platforms.

Linking is rather simpler than compilation in some respects, but exhibits a large number of possible feature interactions. For example, how do symbol visibilities interact with weak definitions?

One challenge is to generate only link jobs that “should” work. For example, the job should usually (if linking an executable) contain non-weak undefined symbols that lack a matching defined symbol.

Another challenge is to generate code for which some test of behavioural correctness of the output program is not only possible, but also usefully exercises the linker's work in some way: we should obviously generate code that involves cross-module referencing, but it's not clear what else it should do.

A third challenge is to achieve a bias towards “interesting” intersections of features, to improve the probability of finding bugs.

Useful outcomes from a randomized test case would include crashing one or both linkers, or producing divergent behaviours between two existing linkers. Evaluation could be done in terms of coverage of the feature space, and also in bugs found (including known bugs, e.g. in older versions). A starting point would be to extract a corpus of bugs from the change history of LLD.

A testable specification of C and C++ ABI details

Compilers use documented ABI specifications, such as the System V ABI, its x86-64-specific adjunct, and the Itanium C++ ABI, to achieve compatibility at link time and run time. These fix standardised details for representing data, calling code, interfacing with the operating system and achieving link-time effects (such as uniquing of inlineable function instantiations, dynamically interposable bindings, etc..). Given a particular C or C++ source file, what linker symbols should we find in the output binary (.o file), with what attributes and satisfying what properties? What data layouts should be used for various data types? ABIs answer these questions, at least in theory.

Since ABIs are specified only informally and partially, and are not rigorously tested, divergence in these details currently causes compatibility problems between compilers, even if they are intended to implement the same ABIs. This project would involve writing a formal specification of (some of) these details, partly from ABI documentation and partly by observing the output of real compilers, then using differential testing to identify divergences and (perhaps) report bugs in real compilers. This includes such issues as how particular source language features map to particular linker-level features, across differnet code models and when built for static or dynamic linking. Some experimental work will be necessary particularly in the area of extended/non-standard C features (packed structures, bitfields, ...) and C++ (vtables, typeinfo, ...), with differential testing against gcc and LLVM toolchains.

A lightweight user-space virtualisation system using the dynamic linker

The dynamic linker (ld.so) is a critical component of commodity Unix platforms. It is used to interpret almost all binaries on the system, and contains the earliest-run user-space instructions. It is therefore a potential interface for virtualisation: a modified dynamic linker can provide unmodified application code with the illusion of an extended kernel interface and/or differing instruction semantics. As with hypervisors, this can be done with either trap-and-emulate or binary-rewriting techniques, or both. This project concerns building a usable dynamic linker which virtualises the user-space environment on a mainstream Linux or FreeBSD platform and adds one or more “added value” services or tooling features. The choice of added value is somewhat free, but a simple suggestion is a memory profiler, similar to Valgrind's massif but offering much lower run-time overhead. Another might be a filesystem extension something like flcow or fakeroot.

Either way, the project will involve implementing an extended version of some existing dynamic linker. To be useful, this should include (1) self-propagation support, so that child processes are executed under the new dynamic linker; (2) instrumentation of the necessary instructions or system calls to deliver the added value; and (3) an implementation of the added value itself. The starting point will include, should the student find it useful, a basic proof-of-concept tarball for extending the GNU C library's dynamic linker on Linux. This project will demonstrate understanding of operating systems, assembly-level programming, compilers, virtualisation, and instrumentation-based programming tools. There are lots of possible extensions, including instruction emulation, system call interposition (particularly challenging in multithreaded contexts), supporting statically linked binaries, supporting setuid binaries, adding simple instrumentation primitives (e.g. call relinking, GOT/PLT redirection), and integrating heavyweight instrumentation systems (by plumbing in a system like Valgrind, DynamoRIO or FastBT).

A testable formal specification for dynamic linking

(This one is probably more of an MPhil project.)

Dynamic linking exhibits subtle variations both across different Unix platforms (such as GNU/Linux, FreeBSD, OpenBSD etc.), and across different dynamic linkers available even within just the GNU/Linux platform (such as musl, uClibc and glibc).

This project will use a combination of documentation and experimental methods to produce a precise specification (yet “loose”, i.e. accommodating allowable variation) of what happens when a dynamic linker loads an executable and its constituent libraries.

Given an input binary, the output is a memory image in the address space containing the dynamic linker. A host of details matter to this transformation: symbol visibility, relocation semantics, program header semantics, and so on. We currently have a specification (in Lem) of some of these, which has been used as a testable specification for static linking. This project will most likely benefit by extending these, albeit in the context of a new executable specification simulating the dynamic linker rather than the (static) “batch” compile-time linker. Another useful component would be ptrace()-based tool for testing against the memory image actually produced by real dynamic linker.

Initially one would probably factor out the filesystem aspects, i.e. assume that the output of ldd is known and ignore features such as preloading and auditing that can affect this. These could then be added in later as extensions. One would probably also want to work with preexisting binaries; specifying and checking the (static) production of dynamically-linkable binaries could again be a useful optional extra, for which we have some initial work.

A machine-readable, testable system call specification usable by instrumentation-based tools

Many programming tools, including debuggers and tracers and also, especially, tools built with instrumentation frameworks such as Valgrind or DynamoRIO, require knowledge of the system call interface. Currently these use hand-maintained knowledge of syscalls which is often incomplete and/or incorrect. Pooling and automating these specifications would improve quality and completeness, reducing the costs of keeping tools up-to-date. Candidate tools/frameworks include qemu, DynamoRIO, Valgrind, gdb, strace, and our own ppcmem/rmem.

A system call's register footprint (arguments in/out, clobbers) is defined by the kernel ABI. Most system calls also have a memory footprint, which is much harder to describe. Some tooling for describing and testing Linux system call interface and a very partial specification of calls' read/written memory footprints can be supplied as a starting point.

Other useful information includes argument types (mostly done already), possible error codes and their necessary and/or sufficient preconditions, the latter perhaps initially limited to argument well-formedness but extending later to details of the existing memory map, file descriptor state, resource limits, signal dispositions, and so on. (Of course error conditions also depend on filesystem state, which is rather complex by itself, and subject to interesting race conditions; a thorough job of this is not feasible within one project, but some cases could be addressed.)

One extension of memory footprints concerns the well-definedness of the contents of memory written and/or read—for example Valgrind's Memcheck would ideally consume a specification of output (written) definedness as a function of input (read) definedness.

System calls are “downcalls”; the specification could also consider system “upcalls”, mostly meaning kernel-generated signals, the conditions on which they might occur and the properties expected to be true of the process state at those times.

This project would ideally pick a couple of tools and a couple of architectures (including AArch64) and produce a testable specification usable by both tools on both architectures as a replacement for current ad-hoc architectures, including specification of those aspects of the interface that matter to those tools. We assume the host tool/framework already has mechanisms for recognising when a system call is occurring, and what state the process memory/registers are in. We also assume the tool has a means to request system calls (typically emulated by issuing host system calls, perhaps doing a verbatim or modified version of the guest code's system call, e.g. to do guest/host address translation).

Maintaining testability of the overall spec is useful, e.g. using SystemTap or DTrace to check actual system state against what the specification says should happen. Again a proof-of-concept for memory footprints will be provided for extension.

[/research] permanent link contact

Tue, 14 Feb 2017

Custom ELF program headers—what, why and how

A couple of times recently I've found myself wanting to create ELF files with custom program headers. This post explains what that means, why you might want it, why there's a practical difficulty in doing so using a standard linker, and my slightly hacky way of solving it.

Program headers are metadata within ELF files. They describe a contiguous chunk of the file's address space. There are several different kinds of program header, of which the most critical is LOAD, signifying that the relevant part of the file should be loaded (or mapped) into memory, forming a segment within the program's memory image. Besides LOAD, other program headers are used to attach particular meanings to specific ranges within the binary. For example, the DYNAMIC program header tells the loader where the dynamic-linking metadata resides. The INTERP program header identifies to the (in-kernel) loader a string naming the program interpreter (the dynamic linker, which will actually do the loading proper). Some vendor extensions exist for adding features within the loading process, such as GNU_RELRO (which records a range that can be made read-only after relocation, bringing security benefits) or GNU_STACK (which does not identify an associated range, but exists as a flag to tell the loader whether the stack needs to be made executable). If you run readelf -l on some ELF binary, you'll see a dump of its program headers. On my machine I see something like this.

$ readelf -Wl /bin/true
Elf file type is EXEC (Executable file)
Entry point 0x401432
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000400040 0x0000000000400040 0x0001f8 0x0001f8 R E 0x8
  INTERP         0x000238 0x0000000000400238 0x0000000000400238 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x0058b4 0x0058b4 R E 0x200000
  LOAD           0x005e10 0x0000000000605e10 0x0000000000605e10 0x000404 0x0005f0 RW  0x200000
  DYNAMIC        0x005e28 0x0000000000605e28 0x0000000000605e28 0x0001d0 0x0001d0 RW  0x8
  NOTE           0x000254 0x0000000000400254 0x0000000000400254 0x000044 0x000044 R   0x4
  GNU_EH_FRAME   0x004b4c 0x0000000000404b4c 0x0000000000404b4c 0x00023c 0x00023c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x005e10 0x0000000000605e10 0x0000000000605e10 0x0001f0 0x0001f0 R   0x1

Two reasons for wanting to add custom program headers are to add custom features to the loading process, or to make new uses of features it already provides. My first use case was an instance of the latter: to create a heap area at load time, as an additional segment, rather than the usual way of making explicit syscalls during execution. I was building ELF files for the ppcmem emulated execution environment. This doesn't implement any system calls, but does implement an ELF loader. By adding an extra LOAD program header providing an allocation arena, I could write a malloc() that does not require system calls—its heap arena is mapped at start-up by the ELF loader inside ppcmem. (Although I could have allocated space using a big char array, or by hacking the linker script to leave a big gap, a separate segment is logically cleaner, and a disjoint address range for heap allocations is helpful.)

The second use case for custom program headers comes from liballocs. When passed through the liballocs toolchain, a binary also generates extra metadata binaries which can optionally be loaded at run time to supply extra metadata similar to debugging information. Currently these files live outside the main binary, in a separate /usr/lib/allocsites hierarchy, which is a pain for deployment: if you move or copy an executable, you must move or copy its metadata as an extra step. I'd like to bundle the metadata into the binary somehow, and record its presence using a custom program header.

The way to specify the program headers is in the linker script. The problem with creating custom program headers is that a GNU-style linker's script language requires you to specify the entire set of program headers if you specify any program header at all. You can't just add one extra. Worse, the default set of headers is not even made explicit anywhere. The PHDRS command is what specifies the program headers, but the default linker script (try ld --verbose to see it) doesn't use it. Rather, it falls back on the linker's hard-coded default behaviour, which is to create all the familiar headers that we see above.

If we write our own PHDRS command we can create all the program headers we want—but we have to create them all ourselves, meaning manually creating all the ones that the linker would otherwise create for us. The set of headers created by hard-coded logic in the linker has grown over the years, to include things such as the GNU_STACK and GNU_RELRO headers I mentioned earlier. Re-creating this “standard” set of program headers is therefore quite a detailed platform-specific task, and is subject to change as new features are added; a burden we don't want to take on.

So, I came up with a way of getting the effect we want: “please create this extra program header, in addition to all the normal ones”. It is not pretty and not perfect, but it works as follows. Run the linker once, without any custom program header directives. Remember what program headers the linker generated, and what sections it assigned to them. Dump the linker script used during that link, and annotate it so that each output section is labelled with its program header assignment. Then, apply a diff to the linker script adding only the program header we care about. Finally, re-link with the diffed script.

The obvious downside is that we link twice. We're also a bit fragile to the organisation of the default linker script. Still, I'm prepared to pay these prices. I've implemented this approach with a pleasingly nasty (but small) pile of shell scripts, make rules and m4 macros. It now works pretty well for me. Here's a brief tour.

First, we have some makerules to dump the default linker script.

default-dynexe.lds: $(shell which ld)
    $(LD) --verbose 2>&1 |\
      sed -e '/^=========/,/^=========/!d;/^=========/d'\
          -e 's/\. = .* + SIZEOF_HEADERS;/& _begin = . - SIZEOF_HEADERS;/'\
      > "$@"

(Thanks to Dan Williams for the sed script, which I took from his page about creating a custom dynamic loader, which is well worth a read.)

Next, we have some makerules to turn such a script into an m4'ised macro-script. Essentially, this macroisation adds two “hooks” into the script: one to include a file describing the “normal” program headers and insert any new ones; and one to wrap each output section in a macro invocation that will decorate it with its corresponding program header names. This macroisation rewrite is easily done with regular expressions.

%-dynexe.lds.phdrify.m4: %-dynexe.lds
    cat "$<" | sed 's#SECTIONS#PHDRS\n{\n\tinclude(phdrs_dynexe.inc)\n\tinclude($(srcdir)/phdrs_extra.inc)\n}\nSECTIONS#' | \
    tr '\n' '\f' | sed -r \
    's@(\.[-a-zA-Z0-9\._]+)[[:space:]\f]+(([^\{]*[[:space:]\f]*)?\{[^\}]*\})@expand_phdr([\1], [\2])@g' | \
    tr '\f' '\n' > "$@"

Next, our phdrified linker script contains output section blocks that used to look like this

  .rodata         : { *(.rodata .rodata.* .gnu.linkonce.r.*) }

now wrapped in macro invocations, like this.

  expand_phdr([.rodata], [: { *(.rodata .rodata.* .gnu.linkonce.r.*) }])

Once we have a macroised linker script, we need to m4-process it with some macro definitions that turn it back into the script we want. We need two kinds of definition for this. The first is a list of the program headers in the file as it was originally linked. The easiest way is to use readelf to dump the existing ones. As a wart: since (on my machine at least) the linker doesn't understand the full range of symbolic (named) values for the program header's type field, we use the C preprocessor to macro-expand the symbolic names down to hex literals (as defined in elf.h), which it does understand. The second kind of definition is going to expand each output section to explicitly mention which program headers it belongs to. Note that a section can be in more than one program header. I used a bash script with associative arrays, again operating over readelf output, which handily outputs the mapping from segments to section lists. The script inverts this mapping, into a section-to-segment-list mapping. This generates a single m4 macro whose body is a giant if-else chain testing the argument section name against all the sections in the file, and yielding the list of program headers that the section is in. The easiest way is to use readelf to dump the existing ones. The result is a chunk of m4-ified text that describes the program headers in the original link, looking something like this.

phdr0 6;
phdr1 3;
phdr2 1;
phdr3 1;
phdr4 2;
phdr5 4;
phdr6 0x6474e550;
phdr7 0x6474e551;
phdr8 0x6474e552;

After all that. we then simply m4-include a second file describing the custom ones, and use m4 to expand out the program header assignments. Our output section before has now become

  .rodata : { *(.rodata .rodata.* .gnu.linkonce.r.*) } :phdr2

i.e. recording that the read-only data is in program header named phdr2 (the first LOAD in the dump we saw early on; this is the text segment, which also holds read-only data).

When we re-link with this script, we can add whatever extra linker inputs we want to go into our custom segment, and also apply a diff to the linker script. A simple diff looks like this, where the stuff in sections whose names begin  .insert will end up in a new segment.

--- default-dynexe.lds  2016-09-14 23:12:31.000000000 +0100
+++ insert-dynexe.lds   2016-09-14 23:13:46.000000000 +0100
@@ -187,6 +187,9 @@
   . = ALIGN(64 / 8);
   _end = .; PROVIDE (end = .);
   . = DATA_SEGMENT_END (.);
+  /* srk: new segment! */
+  .insert  : ALIGN(0x1000) { *(SORT_BY_NAME(.insert*)) } :insert
   /* Stabs debugging sections.  */
   .stab          0 : { *(.stab) }
   .stabstr       0 : { *(.stabstr) }

The .insert output section will also be aligned to a 4kB boundary, and be placed in the program header called “insert” by the script. If the input section is allocatable, it will also go into a LOAD header. By varying this patch, you can put whatever section you like in your program header, aligned however you like.

There are some warts in all this. One is that I found increasing the number of program headers would confuse the GNU linker into placing some of its implicitly generated sections, .gnu.build-id and .note.ABI-tag, at a silly offset in the file, and that would mess up the address assignment, by causing the file header no longer to be mapped at virtual address 0. To hack around this, I inserted a one-line sed command which explicitly places the .note.ABI-tag section. Another wart was that the PHDR header was no longer generated properly: it was coming out with an address of 0, not of the executable's start address (0x400000). This turned out to be because no LOAD header was including the region of the file containing the file and program headers themselves. To hack around this, we add a little more rewriting magic to our shell scripts, to put the PHDR and FILEHDR linker attributes to any program header entry with type PT_PHDR, and also to the text segment's LOAD program header (usually the first), then updating the latter's file offset and memory addresses so that the segment actually spans that part of the file.

I mentioned that I wanted to use custom program headers to embed one ELF file (a metadata object) inside another. How can we do this? Firstly, we embed the metadata ELF file into the enclosing ELF file and mark it with a custom program header. Secondly, at run time, when we want to load the metadata, we locate the program header and get a file descriptor pointing at the embdded content. Then we need a dynamic linker that can dlopen() from a file descriptor; we can hack the existing GNU dynamic linker for this fairly easily (not shown, but I've done it; ask me). Finally though, we need to handle the fact that file offsets in the embedded content will not match file offsets on this file descriptor! So we need to translate all the ELF file offsets in the embedded file by a fixed amount. File offsets live in the ELF header, program headers and section headers. (They might also live in other sections, but I'm not aware of any.) I wrote a simple C program to add a fixed offset to these. There's also some guesswork needed to figure out what that offset should be, since the linker will be in control of the in-file placement of the extra output section. This could be made reliable using one more link pass (bringing the total to three).

There's a GitHub repository containing these recipes and examples, here. It's all a bit GNU-specific and underdocumented for now.

All this effort to achieve a fairly simple task: package one ELF object inside another so that the “inner” object is loadable when running the “outer” object. What's another way of looking at this problem? We can think of ELF files as describing an initial state of the program (or, in the case of a library, a fragment of state). What we'd like them to be able to do is specify that this state includes a given file of a particular name and contents. In other words, we'd like them to describe not only aspects of the memory state, but also aspects of the filesystem state. Linkers already do something a bit this, namely static linking: objects that might not be on the destination filesystem can be bundled into the file, albeit losing their identity as files. Thinking of binaries as “partial initial program states” in this way, there's a clear relationship between ELF executables, Docker images and other collections of memory(-mappable) objects.

Our “embedded ELF file” requirement would be trivial if segments were nameable as files in the first place. This is of course a design feature of Multics, as was dynamic linking. Although the latter has now been picked up by Unix, files and segments have still not been fully re-unified. Maybe it's time? Dynamic linking is one of two key mechanisms, along with filesystems in userspace (whether this or this) that allow us to prototype extensions to Unix without pushing our changes into the kernel. There are usually performance, security or correctness reasons why such extensions are better off ultimately residing at least partly in the kernel. But making a playground entirely in user space is valuable for research and experimentation, and is something I'm working up to. The “embed part of a filesystem inside your binary” idea is one example of the sort of feature that we want to be easily prototyped in such a system. The core features we need—serving the content, and then binding it into the process's namespace”are already provided by the OS, so it's a matter of joining them together.

[/devel] permanent link contact

Mon, 30 Jan 2017

Debugging with the natives, part 2

Some aeons ago, I wrote a piece giving a rough outline of how a native debugger like gdb works, and promised a follow-up that looks at the same pile of techniques in a more principled way. I can't really excuse the delay, but if anyone's still listening, here goes.

Source-level debugging of native code is supported on (I make it) four different different levels in a complete hardware–software stack.

We can split these four into two pairs. The first two implement machine-level primitives and their “upward mapping” (virtualisation) into the operating system's process abstraction. The second two implement the source-level view that the programmer usually prefers, again mapping upwards from binary- to source-level.

Let's take the machine-level pair to begin with. From the operating system's point of view, all debugging support is designed around the following principles. (Perhaps I shouldn't say “designed” since in reality they “grew”— and only became principles later.)

Some surprisingly strong properties result from this design. Firstly, debugging can be done from a remote process, perhaps on a separate machine from the debuggee, perhaps even a machine of a different architecture. Secondly, debugging can be done post-mortem. Thirdly, the same infrastructure works for many source languages—albeit trivially so far, since we've only seen how to get an assembly-level view. There are some contrasts here with most language virtual machines (think JVM): these implement debugging using in-VM debug servers. These can work across the network, but don't support post-mortem debugging, and typically bake in concepts from source languages.

That's enough about the machine-level part. To go from machine- or assembly-level debugging to source-level debugging, we need help from the compiler. This is designed around the following principles.

Let's do a more concrete run-through of how it works. So far I've been fairly generic, but let's fix on GNU/Linux as our modern Unix—though all ELF-based systems are pretty similar—and a familiar architecture (x86-64) and specific metadata format (DWARF).

When I compile a program with -g it means “please generate metadata”. First, let's try without.

$ cc -o hello hello.c
$ readelf -S hello | grep debug  # no output! no debugging sections

You can still debug this program, at the assembly level, because the OS debugging mechanisms remain available. It's as if the compiler-generated assembly is code that you wrote manually by yourself. You can set breakpoints, watchpoints, single step, and so on.

$ gdb -q --args ./hello
Reading symbols from ./hello...(no debugging symbols found)...done.
(gdb) break main
Breakpoint 1 at 0x40052a
(gdb) run
Starting program: /tmp/hello 

Breakpoint 1, 0x000000000040052a in main ()
(gdb) disas
Dump of assembler code for function main:
   0x0000000000400526 <+0>:     push   %rbp
   0x0000000000400527 <+1>:     mov    %rsp,%rbp
=> 0x000000000040052a <+4>:     mov    $0x4005c4,%edi
   0x000000000040052f <+9>:     callq  0x400400 <puts@plt>
   0x0000000000400534 <+14>:    mov    $0x0,%eax
   0x0000000000400539 <+19>:    pop    %rbp
   0x000000000040053a <+20>:    retq 

Now let's compile with debug information.

$ cc -g -o hello hello.c
$ readelf -S hello | grep debug
  [28] .debug_aranges    PROGBITS         0000000000000000  00001081
  [29] .debug_info       PROGBITS         0000000000000000  000010b1
  [30] .debug_abbrev     PROGBITS         0000000000000000  00001142
  [31] .debug_line       PROGBITS         0000000000000000  00001186
  [32] .debug_frame      PROGBITS         0000000000000000  000011c8
  [33] .debug_str        PROGBITS         0000000000000000  00001210

What's in these debug sections? There are three main kinds of information. Firstly, there are files and line numbers (.debug_line). These encode a mapping from object code addresses to source coordinates (file, line, column). You can dump it fairly readably, as follows.

$ readelf -wL hello 
Decoded dump of debug contents of section .debug_line:

CU: hello.c:
File name                            Line number    Starting address
hello.c                                        4            0x400526
hello.c                                        5            0x40052a
hello.c                                        6            0x400534
hello.c                                        7            0x400539

Secondly, there is frame information (often this comes out in a section called .eh_frame so I cheated a bit above; to get exactly the above with a current gcc, you should add the -fno-dwarf2-cfi-asm switch). This tells the debugger how to walk the stack. What does walking the stack mean? Roughly it means getting a sequence of stack pointers paired with their program counter values, for each call site that is active on the callchain. The old stack pointer and program counter are always saved somewhere, otherwise we couldn't return from the call. To walk the stack we start from the current “live” register file given to use by ptrace(), which holds the “end” stack pointer and program counter. The DWARF then describes how to “rewind” these register values, and/or any other registers whose job is to record the callchain (rbp on x86-64; other callee-saves are often included too) back to their state at the previous call site in the chain. The description of this unwinding is logically a table, which you can see below for the main function. The cells are expressions describing how to compute the caller's value for a register, here rbp value (frame pointer) and also the caller's program counter, i.e. the return address (given as ra). The computations are both factored into two. Firstly we calculate a “canonical frame address” from the current frame's register values (see the CFA column): it's a fixed offset from rsp and rbp, and is actually a fixed address on the stack, but the expression changes from instruction to instruction as the stack pointer gets adjusted. Secondly we obtain the saved values we want by reading from the stack at fixed offsets from that address (c-8 means 8 bytes down). This factoring helps compactness, because the CFA-relative offsets don't change when the stack pointer moves; only the CFA column needs to describe that. However, although “stored at some offset from the CFA” covers a lot of cases, sometimes more complex computations are required, which usually appear as DWARF bytecode expressions.

$ readelf -wF hello
00000088 000000000000002c 0000001c FDE cie=00000070 pc=0000000000400526..000000000040053b
   LOC           CFA      rbp   ra    
0000000000400526 rsp+8    u     c-8 
0000000000400527 rsp+16   c-16  c-8 
000000000040052a rbp+16   c-16  c-8 
000000000040053a rsp+8    c-16  c-8 

The .debug_info section is the biggie. It describes the structural detail of the source program along both source and binary dimensions. It has a list of source files, but also a list of compilation units. The latter is where most of the structure is. It describes functions/methods, data types, and all the language-implementation decisions that the compiler took when generating binary code: how data types are laid out, which registers or stack slots hold each local variable over its lifetime, and so on. Although not shown much in the simple case shown below, addresses of program variables are described in a Turing-powerful stack machine language which is essentially a bytecode; the DW_OP_call_frame_cfa below is one operation, which simply says “push the address of the frame base, as recorded by the frame info”. The tree-like structure of the information also describes detailed static structure of code, including function inlining, the in-memory locations corresponding to particular lexical blocks in the code, and so on. (It's worth asking whether DWARF info it might usefully bundle the source code itself. I've never seen this done, but it would make a lot of sense to me.)

$ readelf -wi hello
Contents of the .debug_info section:

  Compilation Unit @ offset 0x0:
   Length:        0x8d (32-bit)
   Version:       4
   Abbrev Offset: 0x0
   Pointer Size:  8
 <0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
    <c>   DW_AT_producer    : (indirect string, offset: 0x2f): GNU C 4.9.2 -mtune=generic -march=x86-64 -g -fno-dwarf2-cfi-asm -fstack-protector-strong
    <10>   DW_AT_language    : 1        (ANSI C)
    <11>   DW_AT_name        : (indirect string, offset: 0x88): hello.c
    <15>   DW_AT_comp_dir    : (indirect string, offset: 0xb5): /tmp
    <19>   DW_AT_low_pc      : 0x400526
    <21>   DW_AT_high_pc     : 0x15
    <29>   DW_AT_stmt_list   : 0x0
 <1><2d>: Abbrev Number: 2 (DW_TAG_base_type)
    <2e>   DW_AT_byte_size   : 8
    <2f>   DW_AT_encoding    : 7        (unsigned)
    <30>   DW_AT_name        : (indirect string, offset: 0x0): long unsigned int
 <1><34>: Abbrev Number: 2 (DW_TAG_base_type)
    <35>   DW_AT_byte_size   : 1
    <36>   DW_AT_encoding    : 8        (unsigned char)
    <37>   DW_AT_name        : (indirect string, offset: 0xa2): unsigned char
 <1><3b>: Abbrev Number: 2 (DW_TAG_base_type)
    <3c>   DW_AT_byte_size   : 2
    <3d>   DW_AT_encoding    : 7        (unsigned)
    <3e>   DW_AT_name        : (indirect string, offset: 0x12): short unsigned int
 <1><42>: Abbrev Number: 2 (DW_TAG_base_type)
    <43>   DW_AT_byte_size   : 4
    <44>   DW_AT_encoding    : 7        (unsigned)
    <45>   DW_AT_name        : (indirect string, offset: 0x5): unsigned int
 <1><49>: Abbrev Number: 2 (DW_TAG_base_type)
    <4a>   DW_AT_byte_size   : 1
    <4b>   DW_AT_encoding    : 6        (signed char)
    <4c>   DW_AT_name        : (indirect string, offset: 0xa4): signed char
 <1><50>: Abbrev Number: 2 (DW_TAG_base_type)
    <51>   DW_AT_byte_size   : 2
    <52>   DW_AT_encoding    : 5        (signed)
    <53>   DW_AT_name        : (indirect string, offset: 0x25): short int
 <1><57>: Abbrev Number: 3 (DW_TAG_base_type)
    <58>   DW_AT_byte_size   : 4
    <59>   DW_AT_encoding    : 5        (signed)
    <5a>   DW_AT_name        : int
 <1><5e>: Abbrev Number: 2 (DW_TAG_base_type)
    <5f>   DW_AT_byte_size   : 8
    <60>   DW_AT_encoding    : 5        (signed)
    <61>   DW_AT_name        : (indirect string, offset: 0xb0): long int
 <1><65>: Abbrev Number: 2 (DW_TAG_base_type)
    <66>   DW_AT_byte_size   : 8
    <67>   DW_AT_encoding    : 7        (unsigned)
    <68>   DW_AT_name        : (indirect string, offset: 0xb9): sizetype
 <1><6c>: Abbrev Number: 2 (DW_TAG_base_type)
    <6d>   DW_AT_byte_size   : 1
    <6e>   DW_AT_encoding    : 6        (signed char)
    <6f>   DW_AT_name        : (indirect string, offset: 0xab): char
 <1><73>: Abbrev Number: 4 (DW_TAG_subprogram)
    <74>   DW_AT_external    : 1
    <74>   DW_AT_name        : (indirect string, offset: 0xc2): main
    <78>   DW_AT_decl_file   : 1
    <79>   DW_AT_decl_line   : 3
    <7a>   DW_AT_prototyped  : 1
    <7a>   DW_AT_type        : <0x57>
    <7e>   DW_AT_low_pc      : 0x400526
    <86>   DW_AT_high_pc     : 0x15
    <8e>   DW_AT_frame_base  : 1 byte block: 9c         (DW_OP_call_frame_cfa)
    <90>   DW_AT_GNU_all_tail_call_sites: 1
 <1><90>: Abbrev Number: 0

That's it for the tour. Let me finish with some reflections on what's good and bad about this way of doing things.

Descriptive debugging: the good

Why do debugging this convoluted, long-winded way, with all this metadata, instead of the apparently simpler VM way of doing things? In VMs, debug servers are integrated into the runtime, and offer a fixed, high-level command interface. This allows the VM's compile-time implementation decisions to stay hidden. There is no need to tell the debugger how storage is laid out, where local variables live, and so on, because the command interface is pitched at a higher level of abstraction; those details remain internal to the VM. This is convenient for the VM implementer, since generation of debugging information is onerous to implement. But it also requires cooperation from the debuggee, couples debuggers to a fixed command language or wire protocol, and presents strictly less information to the developer. While VM debuggers are designed around the abstraction boundaries of the source languages, metadata-based debugging actively enables descending through these boundaries. It is sometimes very useful for debugging tools to expose implementation details in this way. The most obvious case is when faced with a compiler or VM bug; the user would like to “shift down” to the lower level to inspect the assembly code or VM state. At other times, there are performance bugs that the developer has a hunch are about cache or paging effects; being able to see the raw addresses and raw memory contents can help here, even when the program is running on a VM.

Being highly descriptive, debugging metadata documents a large number of implementation decisions taken by the compiler, so is useful not only to debuggers but also to profilers, language runtimes (C++ exception handling is usually built on DWARF frame information), other dynamic analysis tools such as Valgrind family, and so on.

Debugging optimised code (without deoptimising)

Debugging metadata must describe optimised code. By contrast, VM debug servers typically arrange that debug-server operations only need to deal with unoptimised stack frames and at most simply-generated code (e.g. from a template-based JIT). Confusingly, even the “full-speed debugging” feature of HotSpot uses dynamic deoptimisation to get back to unoptimised code—the earlier approach was to run the whole program under the interpreter whenever you wanted a debuggable execution. In general, a debuggable VM instance must either refrain from optimisation, or know how to dynamically undo that optimisation when a debugger is attached. So, dynamic deoptimisation is not exactly “full speed”—unlike with native debuggers, execution still slows down significantly when a debugger is attached. Having the VM implement debug operations only over unoptimised code is a restriction that helps make the debug server simple, at some cost in debug-time performance.

The flip side is that VM debugging is pretty good at precisely maintaining the source-level abstraction (modulo VM bugs), without complicating the task of implementing optimisations. Meanwhile, in Unix-land, the debugging experience remains best-effort and only as good as the compiler-generated metadata, which is sometimes wrong or incomplete following complex transformations. When optimisation and debuggability are in tension, debuggability usually takes the hit, so a smooth debugging experience still sometimes relies on a separate unoptimised “debug build”. Tail call optimisations are a classic debugging-hindering optimisation, since they rely on eliding stack frames, meaning the debugger cannot know how many recursive calls are logically on the (source-level) stack. Instruction scheduling is another: the order of operations in the executed code need not match source program order, and this can make for debug-time confusion.

The control problem

Some other properties are incidental but tend to be true of current Unix-style debugging. Debugger support for exercising the target language's control features (exceptions, threads, allocations, ...) is uneven, because this can't be described purely as metadata; there needs to be some interaction with the host runtime. DWARF and similar debugging information is good at the “statics” of describing how to decode the program state, but not good at describing the protocols of interaction with a language runtime, necessar for performing operations such as spawning a thread, allocating an object or throwing an exception. These tend to be difficult unless these happen to be cleanly exposed as entry points in the language runtime. In practice debuggers usually achieve these things by having magic knowledge of particular implementations.

At least one semi-portable interface has emerged with the aim of encapsulating run-time control operations for debuggers' benefit. I'm thinking of libthread_db, best described by Joe Damato's excellent article. Unfortunately it's an abomination, because it violates the principle that implementation details are described by architecture-independent metadata. An odd but cleaner and more consistent alternative would be to bundle snippets of DWARF bytecode for doing these runtime interactions—perhaps in the debug information of a language runtime, either simply calling into the runtime (for cleanly abstracted operations) or doing something more complex. But that is only a technical possibility; there are no proposals or working demos of that as far as I'm aware (maybe I'll make one). This might sound wacky, but if you know about the early history of Java, in Oak and the Green Project, you'll see a certain uncanny similarity in these ideas.

Levels of abstraction

Debugging at multiple levels of abstraction is a neat facility of Unix-style debugging, but is also a difficult power to control. It can be useful to switch down to the assembly-level view, or to switch languages, but this capability doesn't generalise to the case where many abstraction layers are built up within the same language (think C++). The debugger will let you “continue to the next source line”, but it doesn't know how to keep you at the same abstraction level. If the next source line is deep inside a library implementing something fairly basic like a smart pointer, it will skip only as far as that line, whereas you probably wanted to stay roughly at the same level of abstraction, or perhaps within the same codebase. Things get particularly bad when there is a lot of inlining (again with C++). The traditional “step over” and “step into” are hints at this need, but are too crude.

Doing better is currently beyond the debugger's ken, but this problem could be solved: perhaps by bringing in the knowledge of libraries and source file trees that the debugger already has, or perhaps most simply by allowing programmers to manually mark out the boundaries between layers. This could be a simple partitioning over source files and binary objects, or could be something more complex, perhaps sensitive to calling context or argument values (consider the case of the same library used from two places in the same application). “Next”-style operations could then be defined in terms of these layers. I'd love to see this, although the details would take a lot of working out.

To be continued?

There's plenty of research to be done on more debuggable language implementations. This research area doesn't seem to get the attention it deserves. One problem is that debugging is usually an afterthought for researchers, even though it is essential for programmers. Another problem is that not many PL researchers have sight of the design space; they're familiar with either the Unix-style approach or the VM-style one. I hope that in the future we can figure out how to get the best of both worlds.

[/devel] permanent link contact

Wed, 21 Sep 2016

Project suggestion: run-time type information in the FreeBSD kernel

It's past time I wrote up some Part II project suggestions. This post is about a new one. See also one from last year about a debuggable implementation of OCaml, also my suggestions from 2014, another from 2014 and other standing suggestions.

My liballocs project is a library and toolchain extension that allows tracking of run-time allocation types in large C programs, precisely and efficiently. On top of this, the libcrunch system uses this information to do run-time type checking, such as checking that a pointer cast matches the type of the actual pointed-to object. The idea is to make C coding a lot more easily debuggable, by providing clean failure messages at run time, rather like ClassCastException or ArrayIndexOutOfBoundsException in Java. There are various other applications in debugging and program comprehension, since liballocs effectively adds an oracle capable of answering the question “what is on the end of this pointer?”.

Both of these systems have been developed for user-space code but could usefully be applied in the kernel. The toolchain extensions, which gather type information from binaries and analyse allocation sites in C source code, should work with kernel code mostly unchanged. However, the liballocs runtime is written against user-level interfaces and cannot be used as-is.

There are two obvious approaches to creating a liballocs runtime suitable for the kernel.

One is simply to modify the existing liballocs runtime to support kernel-side interfaces. This should yield a liballocs.a library that can be linked into the kernel, adding code to observe the kernel's allocations and maintain type metadata on them as it executes. The main differences will be how the runtime allocates memory, and how it locates and loads type information. Likely, the latter will be trivial (the type information can simply be linked in to the kernel), except perhaps for loadable modules (they must be created with care, so that they contain any additional type information). The former is more complex, since the current runtime makes extensive use of Linux's unreserved virtual memory (via MAP_NORESERVE) which is not available as such in the FreeBSD kernel, although equivalent alternatives are likely possible with some hacking.

The other option is to use DTrace to write a non-invasive “monitor” process, started early during boot, that captures equivalent information to the liballocs runtime. One could then implement a client library which mimics the liballocs API and supports the same kinds of query. This approach is likely simpler to get started, and allows for a completely external, non-invasive client that can be used without recompiling the kernel. However, it is less reliable because it is likely to miss certain kernel allocation events, such as those happening early during kernel initialisation, or those that DTrace drops because of buffer contention. These will lead to incomplete type information. It is also unlikely to be able to answer queries coming from the kernel itself, so is useful for debugging and comprehension tasks (queried by a user-space program) but not for run-time type checking.

Both versions will require careful understanding of the FreeBSD kernel's allocation routines, for which some familiarisation time must be allowed. The DTrace option would be a good vehicle for exploring this, after which an executive decision could be made about whether to switch to the full in-kernel option.

[/research] permanent link contact

Thu, 25 Feb 2016

Debugging with the natives, part 1

By popular request, I'm going to run a couple of articles explaining how native Unix debuggers like gdb or lldb work. In this one, we'll take a quick cross-section of what's going on. I'll say “gdb” but I could be talking about any similar debugger.

Starting things up

When you start a program in a debugger like gdb, it does a variation of the usual fork()exec() trick, in which it sets itself up as a tracer of the child process before it exec()s the program you want to run. Alternatively, the same API can attach to a process already running. Either way, the effect is that any signals received by the “tracee” process will be delivered first to the debugger, and any system calls made by the tracee will also generate a signal. These signals can be inspected, discarded or modified as the debugger wishes, using the ptrace() API. In effect, the process (tracee) and the debugger (tracer) execute by mutual hand-off, suspending and resuming each other.

Walking the stack

Using the same API, gdb can access the raw memory image of the debugged program, and its register file. (Note that when gdb is active, the program is suspended, so its registers are saved in memory.) It can use this to walk the stack and print a backtrace: the register file gives us the program counter and stack pointer. If we have a nice simple stack which saves frame pointers, it simply walks the chain in memory. To print symbolic function names, it can use either the symbol table (if we must) or the compiler-generated debugging information (better; more on that in a moment).

Reading local state

To print a local variable, the debugger uses the same compiler-generated debugging information to look up where that variable is currently located. This lookup is keyed on the program's current program counter, to handle the way that locals get moved around between registers and the stack from instruction to instruction. The result of the lookup might be a memory location, a register number, or even, sometimes, something composite like “first 8 bytes in register X, rest in memory location Y”.

Setting breakpoints

To set a breakpoint on a particular source line, gdb uses a different part of the debugging information which encodes a mapping between binary locations (program counter values) and source file/line/column coordinates. A small subset of program counter values are specially identified as “beginning of statement” instructions; these are normally the ones the user wants to break on. Sometimes, hardware debug registers can be used to implement breakpoints (up to some maximum); the CPU generates a trap on reaching the program counter value stored in the register. More classically, the “soft” method is for the debugger to overwrite the instructions with trap instructions, saving the old instruction. This trap will cause a signal to be generated when the program executes it.

Resuming from breakpoints

To resume from a software breakpoint, a bit of juggling is required. Recall that we overwrote the original instruction. We can replace it, then ask the OS to single-step the program (using the hardware's single-step mode), then re-set the trap. This is racy in multithreaded programs when the other threads aren't stopped. So instead a good debugger will emulate the instruction itself! This means using an internal interpreter for the target instruction set, backed by the memory and register file access that it already has via ptrace().

Conditional breakpoints and expression evaluation

If we have a conditional breakpoint, we also need to be able to evaluate expressions when we take the trap (and silently resume if the expression is false). For this, the debugger has a parser and interpreter for each source language it supports. These interpreters are very different from an ordinary interpreter: the expression's environment comes from the binary contents of the debugged program's image, rather than being a structure the interpreter itself maintains. (The interpreter keeps a small amount of local state, such as for subexpression results, but it's mostly in the debuggee.)

Calling functions

To evaluate expressions containing function calls, we need to be able to run code in the debuggee. To do so, we craft a special frame on the debuggee's stack, a safe distance away from its actual top-of-stack. Having evaluated the arguments as usual, we put them in the relevant registers (saving what was there), tweak the saved stack pointer to point to this crafted state, set the instruction pointer to the callee, and set the debuggee going again. The function runs as normal, and on return it uses an on-stack return address that the debugger supplied. This points at a breakpoint-like instruction that raises a signal, returning control to the debugger. This then cleans up the stack as if nothing happened, and resets the registers accordingly. (Disclaimer: I haven't actually dug through gdb's code here, so I might have one or two things subtly wrong. At least, the above is how I'd implement it. Let me know if you know better.)

It goes on...

There's lots more, but if you understand the above, you've probably got the gist.

All that seems like a big bag of tricks. What are the principles at work in this design? Although the system has “grown” rather than being “designed”, and has a generous share of quirks and unstandardised corners, there are some surprisingly strong principles lurking in it. I've written a little about these in my paper at Onward! last year. Next time, I'll cover briefly some of the same observations, and a few others, hopefully thereby re-explaining how debuggers work, a bit more systematically .

[/devel] permanent link contact

Wed, 30 Sep 2015

Project suggestion: an observable OCaml, using liballocs

It's the season to suggest student projects, and this post is about one which follows up a series of blog posts (1, 2, 3, 4). Feel free to see also my suggestions from last year, another from last year and other standing suggestions

This post is about a suggested project which will produce a working implementation of the OCaml programming language, using an existing front-end but a very different back-end.

Unlike the mainstream implementation, this implementation will be optimized for debuggability, or more generally “observability”. The compilation strategy will be to translate to C, hence obtaining the debuggability (and, to some as-yet-unknown extent, the performance) of the C compilation toolchain.

This means at least three key differences. Firstly, all locally-bound names will map to C local variables. Secondly, OCaml's lexical closures will be implemented in a highly compatible way, perhaps using the technique of Breuel (which is already implemented in the GNU C compiler, but with stack allocation, whose lifetime semantics are not appropriate for OCaml). Thirdly, all allocations will have associated type metadata at run time, using liballocs This will allow parametric polymorphism to be “seen through” by a debugger—so it can tell you, for example, that some stack frame of a function whose type is 'a -> 'a is actually activated with 'a as int, say. (This may or may not be the right formulation of “seeing through”—stronger and weaker versions are possible; ask me!)

Key ingredients are the following:

Optional “extension”-style challenges might include the following:

Evaluation of the system is mostly via performance—the goal would be to get within a reasonable factor of the OCaml native-code compiler. We can also do some experiments to measure observability of our system. One simple experiment might interrupt both versions of a program at randomly chosen call sites and count how much of the local state we could recover (number of frames in the backtrace, number of locals in those frames; perhaps even doing a diff on their values). Call sites are a good choice because all but one frame of the stack is always stopped at one, and instrumenting the native code so that we can do lock-step breakpointing, by counting call invocations, should not be too difficult (though it might mean disabling inlining, which does affect what is being measured). Usually there'll be no contest, since the original OCaml native compiler doesn't let you observe locally bound values at all. There is, however, a development branch which does, and comparing with the bytecode debugger (ocamldebug) is also a good bet.

[/research] permanent link contact

Wed, 16 Sep 2015

Partially evaluating a bytecode interpreter using C++ templates

I should start this by saying that I'm definitely not a C++ metaprogramming genius. People have done outrageously clever and diverting things with C++ templates, and I am not about to outdo anyone. Nevertheless, this might be interesting.

I was facing the task of taking code in a simple but obscure stack machine language—the stack machine from DWARF, if you couldn't guess—and turning it into vaguely efficient machine code. My thinking was spurred on by systems like Truffle, or PyPy's tracing JIT, which produce compiled code from only source code and an interpreter for the source language. Unlike those systems, I didn't need dynamic compilation—offline is fine. But also, I wanted a system with easy support for C and C++ intrinsics, zero runtime baggage, and something I could prototype in a day or so. It turned out that writing the interpreter in C++ templates, and using the C++ compiler as a partial evaluator, was reasonably successful at all these. Here's how it works.

To start with, I simplified right down from DWARF to an analogous but much simpler stack machine where all operands are passed on the stack. We model instructions using simply an enumeration. This is not essential, or even desirable, except for exposition.

/* instruction -- a simple enumeration */
enum instr
    PLUS, /* add top two stack slots and push result */
    BZ,   /* pop val and dest off stack; if val is zero, branch to dest */
    PEEK, /* pop an address off the stack and push its contents */
    DUP2  /* clone the top two elements of the stack */

[I've edited this slightly from my initial version of this post, to add the DUP2 instruction. This allows me to include a slightly more meaningful example program—thanks to Ross Bencina for prompting that.]

The four instructions are representative of the classes of operation I care about: arithmetic, control, C intrinsic and stack manipulation. They're not a very complete machine, but they suffice to illustrate things. To witness the power of this somewhat-operational stack machine, a simple program which combines all four opcodes is as follows.

typedef program< PLUS, program< DUP2, program< BZ, program< PEEK, void > > > > prog;

This program add two numbers taken as input, and if they sum to something non-zero, it interprets that as an address and peeks it, returning what's stored there. Otherwise it returns zero. Note that since we only have a computed branch instruction, not a direct branch, the branch target exists as data in the initial stack. From top to bottom, our initial stack must be [arg1, arg2, 4], where arg1 and arg2 are the two input values, 4 is the program counter to branch to (the end of the program). We don't need to store the literal zero in this case (I'll let you figure out why).

So, clearly, a program is a sequence of instructions. We model it in the C++ compiler as a template-level sequence, with the opcode as one template argument and the program “tail”, i.e. the following instructions, as another template argument.

/* program -- a sequence of opcodes */
template <unsigned Op, class Next>
struct program
    static const long val = Op;
    typedef Next tail;

In effect, this gives us a “linked list” structure but at the template level. We can write some simple helper functions over this kind of structure, using a functional style in which template specialization is how we do case-splitting and template instantiation is how we do recursion.

template <class Seq> 
struct sequence_length
    static const long value = 1 + sequence_length<Seq::tail>::value;

template <>
struct sequence_length<void>
    static const long value = 0;

/* Utility for getting the tail of a type-level sequence. */
template <typename Seq>
struct tail_of
    typedef typename Seq::tail type;
template <>
struct tail_of<void>
    typedef void type;

/* Utility for getting the nth of a sequence, then the rest;
 * -- it's like "let s = drop n Seq in (hd s, tl s)" */
template <unsigned n, class Seq> // n > 0 case
struct nth
    static const long val = nth<n-1, typename tail_of<Seq>::type>::val;
    typedef typename tail_of< nth<n-1, typename tail_of<Seq>::type> >::type tail;
/* termination case: n == 0 */
template <class Seq>
struct nth<0, Seq>
    static const long val = Seq::val;
    typedef typename tail_of<Seq>::type tail;

The main use of sequences is to model the program itself, a sequence of instructions. We also use them to model the program counter, encoding it simply as the “tail” of the program, i.e. the subsequence of instructions from the current position to the end of the program. (This won't stop us from jumping backwards, because we'll always pass around the whole program too. I'll come on to jumps a bit later.)

What's our program state? It's a program counter and an operand stack. Since, in general, our stack will not be statically known, we have to model it at the value level, passing it around as “real” arguments at run time rather than template arguments. (I did create a version where a parallel family of template specializations could be used to encode statically known stacks. That's neat because it you can get the C++ compiler to partially evaluate the stack-machine program for known input. But I'm skipping that extra complexity here.)

To actually represent the stack at run time we'll use a plain old array of long. I'll show you that in a second. Before I do, let's start to piece together what our evaluator looks like. It's a template that takes arguments standing for the whole program and the program tail, i.e. program counter. The whole program is always statically known, so is a template argument. There are finitely many possible values of the program counter, so we can make the compiler enumerate them; this means the program tail can be a template argument too. (This works nicely with how we do branches; I'll come on to that.)

/* States in a computation */
template <
    /* Program */ typename Program,
    /* ProgramTail */ typename ProgramTail    /* i.e. a suffix of the Program, 
                                                 beginning with the next instruction to run. */
struct state
    static long eval(/* some args... */)
        /* figure out our next state and delegate to it... */

We also want each state to support an eval() method which executes the program's continuation from that state. Most of the time this will work by calculating the next state and delegating to its eval() method. For states that are “end states”, like when we hit the end of the program, we'll instead return something—the value on the top of the stack. We can handle this case using another specialization (which we'll define later).

A simple opcode's eval() method looks something like the following, which implements PLUS. We pass the stack as an argument to the eval() method, in the form of two pointers, bottom and top.

template <
    /* Program */ typename WholeProgram,
    /* ProgramTail */ typename ProgramTail
struct state< WholeProgram, program<PLUS, ProgramTail> >
    static long eval(long *bottom, long *top)
        top[-1] = top[0] + top[-1];
        return state<WholeProgram, ProgramTail >::eval(bottom, top - 1);

Here we destructively update the stack, decrementing the top. Note that the whole interpreter works in continuation-passing style, and the current stack is always passed around explicitly and linearly—all our eval() calls are tail calls. Currently we malloc() the stack eagerly to a large-ish size, and abort on overflow. But linearity means we could even start small and then realloc() it if it overflowed. A neater trick would be to make it an actual stack that is lazily allocated by the OS, using mmap()—this would give us gradual allocation without cluttering our instruction stream with overflow checks or realloc() calls (but I haven't implemented this yet).

That's the basic idea of our evaluator: each opcode can be implemented by a template specialization which gets elaborated differently according to its position in the program. What about branches? We can handle branches by a single specialization for the case where the PC is not known statically. As before, we deal with the not-statically-known value by accepting it as a method (value) argument rather than a template (type) argument. The simplest and fastest way to do this is to code up a jump table. For this to work, we also have to bound the size of programs.


struct run_time_pc {};

template <
    /* Program */ typename WholeProgram
struct state< WholeProgram, run_time_pc >
    static long eval(long *bottom, long *top, long pc)
        switch (pc)
#define CASE(n) \
case n: return state< \
        WholeProgram, \
        program< \
            /* We build the tail starting from the nth, using the definitions in `nth<>'. */ \
            nth<(n), WholeProgram>::val, \
            typename nth<(n), WholeProgram>::tail \
        > \
    >::eval(bottom, top);
            /* ... */
            /* MAX_PROGRAM_SIZE */
                assert(pc &gt;= MAX_PROGRAM_SIZE); 
                warnx("branched out of bounds");

Now, to implement a branch opcode, all we need to do is have it call this run_time_pc specialization's eval(), which takes an extra argument: the destination program counter, passed as a value. Passing it as a value is what allows the jumping instruction to actually decide the destination at run time. At worst, the C++ compiler compiles this switch statement down to a jump table which can jump to the code implementing any of the program's instructions. If this happens, the compiler must enumerate all these destinations up-front in order to build the jump table, hence why we had to bound the program size. At best, though, if we're doing a direct branch, the branch target will be statically known in the caller of this eval() template, allowing the C++ compiler to optimise away the dead switch cases, replacing the jump table with a direct branch. (In our current stack machine, all branch destinations are encoded on the stack, so are potentially data-dependent, meaning we don't hit this direct-branch case. But in the actual DWARF stack machine, we can encode direct branch targets into the instruction, so we will hit this case.)

Since each entry in this jump table jumps to a specific part of the program, we immediately transition back to having a statically known PC value. Our implementation of branches is completely self-contained within this one mechanism.

Actually, make that two mechanisms, because I want to add an alternative. We can avoid hackily bounding the size of the program, if we don't mind rewriting our jump table as a chain of ifelse branches. We can use templates to generate this chain of branches, for any length. To do this, we parameterise our run_time_pc type on the upper bound of what program counter value we might be jumping to. Initially this is the length of the program; recursive elaborations' ifelse-tests whittle this down to zero, the lowest-numbered place we could jump to. In this way, the compiler effectively builds our switch statement (as an if–else chain) for us. As before, if it's a direct branch, all the tests will get elaborated away. If it's not, whether we get a nice jump table depends on how clever the compiler is. For me, sadly g++ seems not to generate a jump table, even at -O3, but perhaps it can be coaxed into it somehow (let me know if you know!).

template <unsigned UpperBound> 
struct run_time_pc {};

template <
    /* Program */ typename WholeProgram,
                  unsigned MaxPc
struct state< WholeProgram, run_time_pc<MaxPc> >
    /* This is fast only if the C++ compiler can turn if--else chains back into jump table. */
    static long eval_if_else(long *bottom, long *top, long pc)
        if (pc == MaxPc) return state<
                /* We build the tail starting from the nth, using the definitions in nth<>. */
                nth<MaxPc, WholeProgram>::val,
                typename nth<MaxPc, WholeProgram>::tail
        >::eval(bottom, top);
        else return state<WholeProgram, run_time_pc<MaxPc - 1> >::eval_if_else(bottom, top, pc);
template <
    /* Program */ typename WholeProgram
struct state< WholeProgram, run_time_pc<0> >
    static long eval_if_else(long *bottom, long *top, long pc)
        if (pc == 0) return state<
                /* We build the tail starting from the nth, using the definitions in nth<>. */
                nth<0, WholeProgram>::val,
                typename nth<0, WholeProgram>::tail
        >::eval(bottom, top);
        else abort();

I promised to show how we deal with the end of the program. We just specialize for the case where the program tail is void, i.e. the empty sequence.

/* This specialization handles "end of program". */
template <
    /* Program */ typename Program,
struct state<Program, void>
    static long eval(long *bottom, long *top)
        if (top <= bottom) abort();
        return top[0];

There are a few more fiddly things to figure out to make it all compile, but I've shown you all the interesting stuff. A working snapshot of the implementation is here. It compiled only with g++, not clang++, the last time I tried; fixes are welcome. It implements everything I've mentioned. In the near future I'll turn it into an actual DWARF compiler, and it will appear in liballocs. You might ask: why? In short, compiled DWARF expressions are hopefully a way to achieve faster introspection on complex memory layouts. The liballocs metadata toolchain already “compiles” much DWARF content down to an efficient in-memory representation. By compiling not only manifest DWARF structure but also computational DWARF expressions, we can efficiently introspect on structures whose layout can only be described precisely via a computation—think variable-length arrays, hash tables, and so on. Eventually I also want to support “compiled frame information”, which I hope will allow faster stack walking.

[/research] permanent link contact

Thu, 16 Jul 2015

ELF introspection, robustly and portably

I often find myself writing code that does introspection at the ELF level. This is for things like looking up symbols, walking the link map, figuring out which shared object a particular memory address belongs to, which memory regions are executable, and so on. These are fairly common things to want to do in ptrace()- or LD_PRELOAD-based tools, or binary interpreter-like (Valgrind-style) ones. This post is about how to do these things robustly from within a process, where “robustly” brings some particular restrictions.

What introspection?

What does our address space look like? To spare myself the pain of drawing diagrams, we can ask Linux to dump a somewhat-pictorial view of a simple cat process by doing cat /proc/self/maps. I have reversed the order of the lines so that higher addresses go higher up, and tweaked the formatting slightly.

          address range           flgs  offset   dev   inode                  pathname or other name   
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

    7fff301fe000-    7fff30200000 r-xp 00000000 00:00 0                  [vdso]

    7fff301a5000-    7fff301c7000 rw-p 00000000 00:00 0                  [stack]

    7f4a574bc000-    7f4a574bd000 rw-p 00000000 00:00 0 
    7f4a574bb000-    7f4a574bc000 rw-p 00023000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a574ba000-    7f4a574bb000 r--p 00022000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a57493000-    7f4a574ba000 rw-p 00000000 00:00 0 

    7f4a57298000-    7f4a572bb000 r-xp 00000000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a57293000-    7f4a57298000 rw-p 00000000 00:00 0 
    7f4a57291000-    7f4a57293000 rw-p 001bf000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a5728d000-    7f4a57291000 r--p 001bb000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a5708e000-    7f4a5728d000 ---p 001bc000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a56ed2000-    7f4a5708e000 r-xp 00000000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a56c09000-    7f4a56ed2000 r--p 00000000 08:05 2750795            /usr/lib/locale/locale-archive
        00838000-        00859000 rw-p 00000000 00:00 0                  [heap]
        0060c000-        0060d000 rw-p 0000c000 08:01 286233             /bin/cat
        0060b000-        0060c000 r--p 0000b000 08:01 286233             /bin/cat
        00400000-        0040c000 r-xp 00000000 08:01 286233             /bin/cat

This picture is a refinement of the usual OS textbook diagram of a Unix address space. As usual, we have the executable's text segment (bottom), read-only and then writable data (next two up), heap (next up), a stack (much higher) and the kernel region (up top). (Actually the kernel region is not shown, except for one small part, the vsyscall page, which is the only user-accessible area.) The biggest difference is that we have not only the executable but some libraries—primarily the C library (since cat doesn't use others), but also the dynamic linker (itself a library) and the mysterious vdso. All these are ELF objects. In general, a large subset of a process's structure is actually defined by the collection of loaded ELF objects and a little per-process structure also defined by ELF. Our problem is now: how can we acquire a view on this ELF-level structure, from within the program itself at run time, efficiently and robustly?

(The above listing also shows a few other flavours of mapping that won't concern us here: a memory-mapped file /usr/lib/locale/locale-archive, and two anonymous mappings, beginning at 0x7f4a57493000 and 0x7f4a57293000, which are almost certainly heap areas—we can have many heaps, not just the one that grows out of the executable's data segment. And of course, although we see only one mapping labelled [stack], in a large multithreaded process there will be many such.)

What's ELF?

The basics of ELF linking were first defined in System V Release 4 in 1988, and eventually adopted by non-commercial Unices starting in the mid-1990s (1995 for Linux, with kernel 1.2; 1998 for FreeBSD, with version 3; NetBSD and OpenBSD in 2001 and 2003 respectively). The “classic” picture of a loadable ELF file (direct from the spec) looks like the following.

Execution View
ELF Header
Program header table
Segment 1
Segment 2
Segment 3
Section header table

But actually, that omits some of the most interesting stuff! Let's try again.

Execution View
ELF Header
Program header table
Dynamic linking metadata
Dynamic symbol table
Dynamic relocation table
Segment 1
Segment 2
Segment 3
Section header table
Symbol table

Dynamic symbols, program headers and the miscellaneous dynamic linking metadata are what we're interested in introspecting on. The POSIX libdl API offers one interface for doing introspection at the ELF level, at least for looking up symbols by name. Non-portable extensions, like SunOS/Solaris's dladdr() and the later-added dladdr1(), together with GNU's dlvsym() and dl_iterate_phdr(), get us a bit further, walking program headers and looking up symbols by address or version. We get further still if we allow ourselves OS-specific features, like Linux's helpful maps file in the /proc filesystem that generated the dump above, together with the auxv file or its associated library call getauxval(). If we're happy to parse the ELF file itself, by doing file I/O, we can get at anything we like.

What's “robustly”?

There are problems with all of the preceding approaches. Portability is lacking from many—only the POSIX APIs and ELF file I/O are widely portable. Robustness is lacking from all of them, in the sense that it may not be safe or reliable to use them from “sensitive contexts” where ELF introspection may be required. The main kind of “sensitive context” is one where making system calls or library calls would cause unwanted reentrancy. For example, if we want access to metadata from instrumentation logic that runs in the middle of a malloc() call, calling out to a library which might itself call malloc() is a bad idea. If we're instrumenting the code paths that make system calls (ask me how!), calling out to the OS might also be a bad idea. These unintended reentrancies can easily cause infinite loops, deadlock or state corruption. There are also security reasons why system calls and library calls are less reliable than working entirely within the process: the filesystem might have changed since the filenames in our maps or link map were recorded, for example.

So, how much ELF introspection can we do without the help of system calls, library calls or non-portable code? The answer surprised me: quite a lot, but not quite everything. There is some low-hanging fruit, then some that we can just about reach (provided a couple of assumptions which, although not strictly portable, probably hold for most platforms), then some that absolutely no portable method can let us get (without resorting to file I/O).

Let's state our problem slightly more precisely. Our address space includes a selection of objects (the executable and all loaded libraries), mapped at particular addresses. We'd first like to enumerate these mappings, giving us a look-up from load addresses to the object's originating binary file (pathname)—more-or-less the structure we saw in the /proc/self/maps listing earlier. Furthermore, each object defines some symbols. We'd like to be able to resolve symbols in each object, as well as any other useful other metadata, like their soname, the libraries they depend on, the locations of support structures like procedure linkage table (PLT), and the relocations performed when they were loaded. This is what we saw in the elaborated ELF file picture. The symbols-and-addresses stuff is the most important; the relocation and miscellaneous information isn't useful as often, but we'd like to have it anyway. Finally, there's some useful metadata that the operating system has passed down about the process, in an ELF-defined structure called the auxiliary vector. It turns out that we'll use this to get at some of the other metadata, but some other parts of it are useful in its own right—they tell us the process ID, page size, and various properties of the hardware.

The link map

To enumerate the mapped objects, we need access to a structure called the link map. The portable interface for walking the link map is the mysterious r_debug protocol. This is surprisingly murky in its origins, but it probably comes from SunOS; it was definitely included in System V release 4 (the only solid reference I know is the relevant Programmer's Guide). It's actually not specified in most ABIs, although some seem to do so. The r_debug protocol defines a structure containing pointers to the link map. From an executable, it's easy to find this structure: we usually have a DT_DEBUG entry in PT_DYNAMIC, and the dynamic linker updates to points to a r_debug structure it allocates. This contains a pair of pointers to the link map, represented as a doubly-linked list.

What about from a shared library? Getting a pointer to the r_debug is trickier here, because shared libraries don't usually have a DT_DEBUG entry. This is a chicken-and-egg problem: we need the executable's r_debug to get to other objects, but if we're not starting in the executable, how do we get there? So here comes our first somewhat-non-portable assumption: that a global symbol _r_debug is available, and marks the r_debug structure in the dynamic linker. This symbol is commonly present, but I don't believe it's standard—POSIX generally specifies programmer-facing APIs, not linker-level symbols. (As we'll see shortly, another method for getting at it is available, but it is also non-portable and imposes some other requirements. Anyway, by combining these two options, we have a pretty good chance that we can find the link map. So, let's continue on the assumption that we have the link map.)


Once we have the link map, we'd like to do symbol lookups in each object. For this, we need (at least) the dynamic symbol tables, which we can find via the PT_DYNAMIC segments of each object. (We probably also want the symbol hash tables; the method is similar.) The link map helpfully keeps a pointer to this specific segment (rather than to the overall program header table). The segment's content consists of key-value pairs, and it's the pair whose key is DT_SYMTAB that points to the dynamic symbol table. (On most ABIs, a short cut to get a pointer to the containing object's PT_DYNAMIC segment is to make a link-time reference to the symbol _DYNAMIC. But this only lets us get the containing object's PT_DYNAMIC, whereas walking the link map gets us all of them.)

One caveat about symbol lookups is that only dynamic symbols are looked up. The other symbols, if there are any (not needed for dynamic linking, but useful for debugging), live in a different table, .symtab, which needn't be mapped at run time. Depending on how your object was linked, perhaps most symbols became dynamic symbols (--export-dynamic) or perhaps only a minimal selection did. If you've ever tried to look up stat() in the GNU C library, using dlsym(), and been surprised not to find it, this is why. The stat() function is one of a few that are always linked statically, even when the rest of the C library is linked dynamically. In general, not all code or data in our dynamic objects is actually described by dynamic symbols. I'll return to this shortly.

The auxiliary vector

The link map gave us a list of loaded objects and their base addresses, but it doesn't tell us where each individual segment is mapped. This is where we begin our voyage into the not-really-portable. In the case of the executable, we can get the program headers via another useful metadata structure: the auxiliary vector. This is created by the operating system to supply various information of use to startup code (typically the bowels of the C library) and the dynamic linker (which, if one is used, also mutates the metadata to make it look as though the executable was loaded directly). (If we failed to get the link map using the previous _r_debug symbol method, we can also use the auxiliary vector to get to the executable's DT_DEBUG, and hence to the link map.) Here's a very nice article all about auxiliary vectors.

The auxiliary vector lives high on the initial thread's stack, a fixed offset above the stack pointer when the program's first instruction receives control. If we could walk the main thread's stack right to the top, we could find the auxiliary vector easily. However, nowadays, since we lack frame pointers, that means making a library call to something like libunwind, and that might allocate memory. Even then, it's not guaranteed that we can walk the stack all the way to the top, since unwind information might be patchy up there. So we'll need a different approach.

I devised a nasty but ultimately pleasing hack for this. When we say “auxiliary vector” we really mean specifically a list of key-value pairs containing the useful metadata I mentioned. But there's also a bunch of other things up there, in a large contiguous structure (see the helpful article for a diagram): the environment strings (a blob of characters), argument strings (ditto), and vectors of pointers into these blobs (our initial environ and argv). Instead of walking the stack, can we get hold of a pointer into one of these blobs, and then parse the surrounding memory to find the start of the key-value pairs? It's very likely that we can, since we can expect to find a handy global variable, environ, that points into them.

Now, it's important to note that this isn't completely robust. The auxiliary vector only records the initial environment. A program can modify the current environment strings to its heart's content by using C library calls like putenv() and getenv(). These will mutate and, if necessary, reallocate the vector pointed at by environ. However, unless a program deletes its entire initial environment, and assuming the C library doesn't eagerly copy things out of it, that vector should always contain one or more pointers into the initial environment blob. So, we want to walk environ and look for such a pointer.

The next problem: how will we recognise such a pointer when we see it? For this, we need to fix some more very-likely-true assumptions. Firstly, we assume either that the initial stack is the highest user mapping in the process, or that if something is higher, it's a shared object (perhaps Linux's vdso) which we can identify in the link map. (Put differently: what's not allowed is for there to be a memory-mapped file above the initial user stack, other than a shared object. This doesn't seem too restrictive; we could even imagine an ABI-level guarantee from the kernel that it will never create such mappings.) This assumption gives us an upper bound to compare our environment pointers against. What about a lower bound? For that, we assume that the caller can supply us one, in the form of a pointer higher than any other stack-allocated string in environ. The address of a local variable in main() could be one such pointer. In fact, any address on the initial stack will probably be good enough, since putting a stack-allocated string in your environment would be a very quirky thing to do. Anyway, we suppose that our caller, the client code which wants to get hold of the auxiliary vector, can give us a such a pointer.

Now we're ready to go. We scan *environ for pointers to strings within these bounds. Then we navigate downwards in memory to find the end of the auxiliary vector, then keep going to find its beginning, after which we'll see the terminator of the argument or environment pointer vector.

Once we have the auxiliary vector, we have the executable's program headers, from which it's easy to get at most other things we need, again using the link map to access other objects.

Program headers

One challenge remains for us: getting at shared objects' program headers. The problem here is that these headers needn't, in principle, be found anywhere in memory. (The ELF format allows a file to require that its program headers are mapped, using PT_PHDR, but it's not required and its use is uncommon.) Typically a dynamic linker will always keep each object's program headers in memory somewhere, but it's not obliged to do so. Even if it does, that memory might be anywhere. The GNU dynamic linker sometimes stores just a malloc'd copy of the data, perhaps originally read through a file descriptor rather than a memory mapping. Either way, the pointer it maintains to the header data lives in private state (in extra non-public fields in the link map structure); there's no portable access to it. So, to get shared objects' program headers there's no avoiding the use of some platform-specific (and loader-specific) code to fish it out.

The major limitation of doing without program headers is that we don't have a complete map of the address space. The link map lets us enumerate objects and their start addresses, and we can then use the symbol tables to do a “forward lookup” from symbols to addresses. We can invert the symbol tables to give us a partial backward lookup from addresses to symbols. But we can't do quite as complete a backward lookup as the dladdr() function can. If you give dladdr() an address that falls within an object's mappings but isn't covered by an exported symbol, it can still tell you which object it's in. Only the program headers contain the information necessary for this. Another thing we can't figure out is which memory access permissions a segment was mapped with—again, that means looking at the relevant program header.

Maximising ELF introspectability

Armed with these observations, we could imagine dynamically rewriting the binaries on our system slightly to optimise for ELF introspectability. Firstly we would insert a PT_PHDR, and define a symbol (maybe _PHDR), roughly analogous to _DYNAMIC), to help us find it. Secondly, going back to the restriction that only dynamic symbols were available for lookup, we could export all .symtab entries as “hidden” dynamic symbols. The obvious objection to this is that it will slow down linking. However, symbol lookup by name happens via the hash table (DT_HASH or GNU_HASH). I haven't checked all the details yet, but it appears that adding them to the symbol table needn't necessarily mean including them in the hash table. There has to be a hash-chain “next pointer” per symbol, but it's permitted not to hook that entry into any bucket. Since link-time lookups occur via the hash table, leaving the hidden symbols out of the hash table seems to defeat the objection. They'd be present for a “deep” introspector to find, by scanning the symbol table. But they wouldn't interfere with the symbol lookup process, nor with code using dlsym().


If you'd like to use this introspection goodness, I've coded it all up in a single header, relf.h, which is fairly self-contained and should be self-explanatory. Please let me know if you find it useful!

[/devel] permanent link contact

Tue, 30 Jun 2015

A position on licensing research software

I have a habit of putting code on my github page without any statement of license. Sometimes this is oversight or laziness, but sometimes it's entirely deliberate. It's not because I don't want to open-source or otherwise license my code; I do. But, as this post covers in some depth, the Devil is in the details.

In particular, I believe that software whose purpose is scientific or researchy—to exist, to prove a concept, to be extended by other researchers perhaps, but not to acquire users—can be better off without an explicit license, at least while it remains relatively immature and has only a small number of contributors. In short, this is because there's little practical loss from not licensing at this stage, whereas by forcing a choice of license, it's possible to choose wrongly.

For some reason, this belief, at least when compressed into 140-character snippets, attracted quite a lot of opprobrium. So allow me to explain my justification and respond to the criticisms. Before I go further, I should add that I'm definitely not pretending to be an expert on licenses, law, tech transfer, aggregate properties of big companies' IP policies, or anything else I'm writing about. Everything you're about to read is informed by limited personal experience and anecdote. In fact that's how I got here: I'm very aware that licensing is something I don't fully understand. For me, that's a strong motivation for avoiding making premature commitments that I might later regret. You have the opportunity of enlightening me, and might even change my position—but first please understand what my position is and why I hold it.

So, here goes: I claim that early-stage scientific proof-of-concept software is often better off without an explicit license. Naysayers (some hypothetical, but mostly real) claim the following.

“But then people can't legally use your software?” They can, but they should ask first.

“But people rarely bother to ask.” That's fine; this software is not aiming at getting lots of users.

“But you might miss out on users.” I don't want users.

“But you might miss out on collaborators.” I doubt it. Collaborators have to communicate, not just consume. They get in touch. Or, more likely, I meet them through other channels than their browsing my repositories online.

“But people aren't allowed to look at it without a license.” No, that's not how copyright works—quite the opposite. The code is published, so you're allowed to look.

“But company policies mean employees won't be allowed look at your code, even though it's allowed by law.” I've worked at two large IP-paranoid companies, and in both cases employees would happily exercise discretion about whether to look at code lacking a permissive license. (This was true even if official guidance left no room for their discretion.) Dealing with third-party code was also a well-established part of corporate culture. Among other things, this makes these developers much more likely to ask for permission than others. During my three months at one such company, in a very junior position, I was encouraged to make enquiries not only about licensing third-party software that appeared useful, but also about perhaps contracting the developer to support our effort. This surprised me at the time. But being a corporate entity dealing in non-permissively-licensed IP makes you more amenable to dealing with other people's IP in whatever way is appropriate. As I'll cover shortly, the well-known copyright paranoia (including what I'll call “GPL terror”) arises because of cases where the license on offer doesn't suit the prospecting company, and where there's good cause to fear litigation.

“But if they look at your code, they risk creating a derived work, so could get sued by your employer!” Companies know the pragmatics of IP better than this. If I were talking about Big Company Labs's github page, I'd agree that explicit licensing would be highly desirable. Big companies tend to be litigious. In all cases, the question is: who is liable to get sued by whom? But we're not talking about that. The risk to a company from simply looking at my code (emphasis important) is miniscule. The risk to me and to the public good, from premature commitment to a given license, is non-trivial, because it risks closing off opportunities later.

“Why not GPL your code?” I'm not against doing so in principle. In fact, even though I'm told it's falling out of popular favour, the GPL is one of my favourite licenses. But if I GPL'd my code, many more companies' employees would be scared off. Many big companies' legal departments have spawned corporate policies which foster a terror of GPL'd code. Partly this terror is justified: much GPL'd code is copyrighted by large competing companies, who don't want to relicense it and do want to sue infringers. However, the terror is not justified in all cases. I'm not a competing company, I am amenable to relicensing, and I'm not likely to sue infrigers anyway. So companies have no reason to be terrified of looking at my code. Unfortunately, many company procedures are written in a blanket style, with rules like “don't look at GPL'd code!”. that are not sensitive to these nuances.

“How is not putting any license possibly better? It confers strictly fewer rights!” I know it's perverse. But since these “don't look at GPL” policies exist, the absence of any GPL notice would actually make it more defensible, according to these policies, for employees to look at my code in any depth. The real problem here is that, as far as my anecdotes can see, many corporate legal people don't even understand the basic properties of the domain. Software doesn't “have” a license; it is licensed and can be licensed differently to different people. I sometimes say that licensing is “bilateral”, in an attempt to convey this multi-party “edge, not vertex” nature. Being GPL'd doesn't exclude other licensing possibilities. Sadly, corporate legal types tend to act as though it does. They're not the only ones. Some of my peers appear keen to (as I'll call it) “licensify” all scientific software. They seem to be acting on similar misunderstandings. I've even seen it written that one should not use the CRAPL because it allegedly makes it “impossible” to re-use the software. If you actually read and understand the CRAPL, it does no such thing (notwithstanding that as a license it is basically pointless). Even if the CRAPL were restrictive, the kind of software it's intended for—scientific software, as in my case—would almost always be available for relicensing anyway.

“Why not say “GPL, or other licenses by negotiation”?” It's not a terrible idea, but again, lack of nuance among corporate legal departments makes me believe that this will scare people off unnecessarily. I could be wrong about that.

“Why not MIT- or BSD-license your code?” I'm not against doing so in principle. But as a blanket policy, I worry that it'd weaken my ability to successfully do tech transfer later. In particular, the option of dual-licensing code as useful if you have already gone with a permissive license. This might weaken the position of any company set up to do the tech transfer. As I see it, this is the key difference between BSD and GPL from a researcher's point of view. In particular, note that pursuing tech transfer can be in the public interest (and bigger competitors are likely to be less public-spirited than a small start-up founded by scientists). I've nothing against permissive licenses in general; they just feel like too great a commitment to make in the early stages of a project. I happily release non-research software under these licenses (or even public domain) from time to time, as you can see. And I'll happily contribute under these licenses to other projects that have an established tech transfer model. The key points are that my early-stage projects don't yet have this, that there's more than one way to do it, and that licenses make a difference to what options are available.

“So you might later close off your research outputs for your own commercial gain?” Not at all. I'm dedicated to using my research outputs to create the greatest possible public social good. The question is how to do that. Whether BSD-style (permissive) or GPL-style (copyleft) licensing creates the most social good is still a matter of debate. It almost certainly varies case by case. Meanwhile, two other social goods need to be considered: being able to do more research, and actually getting research results transferred into practice. The commercial model can be useful in both of those regards, even though, ideologically speaking, it's not the model I prefer. It would be foolish to jeopardise these opportunities at the moment I publish my code, especially since nowadays the norm is to publish code early (no later than publishing the paper). Very often, at that stage it won't yet be clear how best to harness the code for social good.

“Aren't you inviting people to take the unreasonable risk of using your code without a license, which is breaking the law?” No. Anyone concerned with staying within the law is welcome to contact me about licensing terms. Anyone worried that I might sue them is welcome to do the same. This is no hardship: anyone interested in running or building on my experimental team-of-one research code almost certainly needs to contact me anyway—if not to build it, then to do anything useful with it. There's also an important separate point, which I'll return to shortly: scientific endeavour is built on personal contact and trust relationships, far more than either commercial or open-source software are built on those things. Even in the case of people who don't bother contacting me, the “reasonableness” of the alleged risk needs to be assessed in the context of the scientific culture.

“But in practice, if everyone followed your approach, won't people end up breaking the law more often?” Perhaps so. Most people break the law just going about their everyday business. Few of my countryfolk over the age of 30 have not transgressed our laws in countless ways: copying a tape, keeping a recording of a TV programme for longer than 28 days, wheeling a bike along a public footpath, borrowing their neighbour's ladder without asking, backing up a shop-bought DVD to their hard disk, and so on. Fortunately, the process of law enforcement involves discretion at the point of prosecution. This discretion is kept in check by social dynamics which, while mostly outside legal system, are very real. If my neighbour prosecuted me for borrowing his ladder, it'd render future reciprocal altruism considerably less likely. Suing good causes creates bad press, regardless of whether it can legally succeed. The law is overapproximate out of necessity, so that the genuinely problematic infringers can be pursued, not so that absolutely all contravening activity can be shut down. It presents the option to prosecute, not the imperative to do so. That's not an argument for egregious overapproximation in our laws. Obviously, legal overapproximateness can be vulnerable to abuse, and should be minimised by a process of continuous refinement. But aspiring towards eventually reaching a perfectly precise law is a hopeless ideal. The law is a practical tool, not a formal system—it is both ambiguous and inconsistent. The letter of the law needs to be viewed in the context of the legal process—one carried out with the frailty and imprecision that come from human weakness, and the malleability that comes from human discretion. In short, we are doomed to living with a background level of lawbreaking, and it's in our interest to live this way. With licensing, the right question to ask is: how can our use of licenses affect scientific research, and how should it? This means thinking about process.

“But don't big companies have good reason to be conservative? Won't this harm your relationship with them?” Yes, they do, but no, I don't believe so. Conservative policies are justified in a corporate context. For big companies in competitive markets, the stakes are high, and the capital cost of pursuing legal cases is affordable. If an equally large competitor had such a case, it could turn into a very costly court battle (regardless of who wins). That's why big companies are paranoid about IP: they're afraid of competitors. But they're not afraid of small guys, like researchers or universities. If I had a credible case that a big company were infringing my copyright, the company would very likely settle with me out of court. In any case, as I covered earlier, if they want to use my software, they will want to negotiate terms with me anyway.

“Isn't your attitude socially irresponsible?” I believe I have a stronger social conscience than most people I meet. Everything I've just said is a position I hold in the interest of maximising public good, whether or not my beliefs are correct or effective. So no, I don't believe I'm socially irresponsible at all. Here are some things that I believe are socially irresponsible: to allow corporate capture of publicly funded research outputs; to make licensing commitments of publicly funded work without very careful consideration; to needlessly jeopardise the future potential to exploit publicly-funded work to public-good ends; to allow the practices of big-company legal departments to get in the way of scientific endeavour.

“Licences are good, m'kay?” This seems to be some people's position, but it's not an argument.

“All sensible software projects have licenses!” Software that seeks users is one thing. I've already covered why I'm talking about something else.

“Why are you resisting doing the normal, “responsible” thing?” My situation isn't normal. In this abnormal situation, the normal thing is not necessarily the most responsible thing.

“GPL is not free, troll troll...” I'm trying not to get into that. But I do suspect that this whole debate is really GPL versus BSD in disguise. If you don't believe in copyleft, then the only open-source licenses you're likely to use are BSD or MIT, and there's not much useful flexibility to be gained by deferring the decision. So you can just license all your code straight away. BSD in effect chooses a fixed attitude to commercialisation (“anything is fine; what goes around comes around”) whereas GPL tends to create a fork in the road. I want to hang on to my option on that fork, until I know that doing away with it is the right thing to do. The fact that the permissive-versus-copyleft debate never dies is a powerful argument against coming down firmly on either side. It's better to take decisions in context, case by case. That's exactly the position I'm defending. Note that Bruce Montague's article advocating the BSD license model for tech transfer is right if you're talking about big projects—it's harder to commercialise GPL'd code than BSD'd code, and deliberately so—but completely wrong for most small research projects written by a small group of collaborators. In those cases, it's much easier to relicense the lot, in which case your options for transfer include a dual-license arrangement in the manner of Qt or MySQL. This still permits GPLing of the ongoing commercial development (or not), while retaining the option to license it for commercial development. It says a lot about how subtle the issues are that a simple tweak of context can more-or-less invert the up- and down-sides of the two licenses. (I actually find Montague's article quite fuzzy, even FUDdy, about the details—it only really says that BSD is “simpler”, which any old Einstein should be quick to challenge.)

“I once had licensing trouble, bad enough to require lawyers. Therefore, clearly explicit licensing is essential!” This is a complete non-sequitur. I don't doubt how problematic these things can get. But it doesn't mean that any blanket policy will improve things. If IP lawyers have come into a scene where science is supposed to be happening, it means something has already gone wrong. The culture clash is between science—where trust, good faith and cooperation are routinely assumed—and corporate licensing, where you can't trust anyone beyond what their contract or license binds them to. In short, this happens when big companies get involved and impose inappropriate licensing terms. These companies are where you should be directing your opprobrium—not at me.

To properly explain the latter point, I need to ditch the question-and-answer format (which was already becoming tedious, I'm sure). The argument came up in reference to the Richards benchmark. Let me summarise my understanding of it (corrections welcome). Martin Richards wrote and distributed some benchmarking code. It was distributed as a tarball, in the scientific spirit, with no license stated. It turned out (later) that he was happy to release it into the public domain. Trust, goodwill, reputation, responsibility, ethicality are currency among scientists. Martin's release occurred within that culture. My github repositories do likewise.

Some time later, Mario Wolczko produced a Java version of the Richards benchmark, while working at Sun Labs. Working for a big company makes a person more IP-fastidious than average. Mario very conscientiously sought permission from Martin (perhaps unnecessarily—I'm not convinced he was actually making a derived work, but perhaps he was; no matter). He was of course granted Martin's blessing. He circulated his new code, including to Martin. Martin even began distributing it in his tarball.

The catch was that Mario's employer was a large company, and would only allow the code to be released under some restrictive licensing terms. You can still read them. They come in two parts; the first page is statutory US export regulations while the second page contains Sun's own licensing terms. These include the following restriction.

“Licensee may not rent, lease, loan, sell, or distribute the Software, or any part of the Software.”

This is a severe restriction. It's especially inappropriate for a work of this kind—a benchmark. Benchmarking is a concern that cuts across research areas: it's not just people working on a particular technique or problem that want benchmarks. Anyone working on any aspect of a language implementation might want to apply some “standard” benchmarks. A benchmark is more useful the more things it has been applied to. That's why we often like to group them into suites and circulate them widely. This license is effectively incompatible with that use.

I'm sure Mario tried his best against the might of Sun legal. It's telling that his current web page advises the “caveat emptor” approach: make your own mind up about how likely you are to be sued, and act accordingly. I'd agree it's not good at all that we are being forced to take this approach around a big, litigious company (as Sun's new owner Oracle certainly is). But it seems bizarre to blame this on Martin's original sans-license release. The blame is entirely on Sun. The problem is not a lack of license statement from the little guy; it's an inappropriate license imposed by the big guy. The right response is to decline to use Sun's code. (That said, it's worth noting that Martin's redistribution of the Sun code, although technically a breach of the license terms, has not resulted in Sun or Oracle suing him. So perhaps they, too, understand the value of scientific culture and have no interest in disrupting it.)

To put all this more succinctly, the onus to license doesn't weigh equally on all parties. For artifacts coming out of big players, there's good reason to demand licenses—not just any license, but fit-for-purpose licenses—precisely because big players can afford to sue. By contrast, even the University of Cambridge is a comparatively little player. It is vanishingly unlikely to sue anyone pursuing public-good scientific or research agendas, not only because it's expensive to do so, but because the reputational damage would be huge. And even if you disagree that this particular university isn't a big player, that's immaterial because it doesn't own my research work's copyright anyway; I do.

I, like Martin, work within the scientific culture as a non-litigious “little guy”. It should go without saying that redistribution of my software towards scientific ends will always attract my blessing, not lawsuits—even if people don't ask first (mildly rude as that may be). Meanwhile, for commercial use, I'd like to defer fixing the exact terms until there's serious interest—so that I have some idea of how it's going to be used, and can choose terms appropriately. The pragmatics of “no license specified” and “restrictive license specified” are wildly different, even though a hard-headed legal interpretation makes the former a degenerate case of the latter.

The push towards “licensification” of scientific software actually works against this culture, since it erodes the norm that says that permission will be granted for doing reasonable things for scientific purposes. That's yet another reason why I resist this way of working. Licensification is also viral. If you believe strongly that all released artifacts should “have” “a license”, then you have to follow this to its transitive closure: to release anything that builds on anything, you must obtain a precise statement of license from all derived-from works. This is justified for commercially valuable software, but I don't believe it's the way science needs to work. And if you share my belief that premature commitment can do more harm than good, then we agree that science ought not to work that way.

You might complain that this “scientific culture” I've been talking about is nebulous. I'd agree. But capacity and inclination to sue are reasonable proxies. Call me naive, but if you really worry about the legality of doing science with artifacts released publicly by a reputable scientist working for a university, either you worry too much or the world has gone very wrong. On the other hand, it is reasonable to worry about artifacts released by big companies, because the culture they operate in is very different. That's why we need to pressure them into offering reasonable terms for the things they release, and decline to use their artifacts if they don't. That's what conspicuously failed to happen in the Sun/Richards case.

As more and more scientific research involves code, licensing of code among scientists is going to become much more of an issue. Misunderstandings are already rife: a quick bit of web searching will easily turn up things like this and this. We need to have a frank discussion about what the real issues are, and an open acknowledgement of their subtlety. I believe that blanket licensification is not the answer. It forces a decision to be taken too early, when too little information and expertise are available. I also fear that it will encourage a culture of litigiousness and bureaucracy that is neither desirable nor affordable within science.

As one tentative suggestion for how to make the law work better for us, I note that US copyright law's “fair use” allowance already covers educational purposes. Ideally it should also cover public-good non-commercial scientific purposes. In fact, this already is a factor in assessing fair use. But it's ill-defined, and I'm not aware of any useful case law on that front (are you?). If we want something to work towards, this kind of fair use law is what it should be—not the up-front licensification of every published scientific artifact.

Just as a coda: the above is what I currently believe, based on limited study and my very limited awareness of evidence, case law and the like. Although that belief has withstood a certain exposure to others telling me how wrong I am, it's not a conviction I'm strongly attached to. I went to the effort of writing the above not because I'm convinced I'm right, but because I don't like being branded “socially irresponsible”. Challenge my argument if you like, but don't doubt that I treat this issue with due care and attention. I'm still waiting for the counter-argument that will make me change my position. You might well have it. If you do, please let me know.

[/research] permanent link contact

Tue, 26 May 2015

Polymorphism and observability

[In an earlier post, I talked about debugging, and more generally “observability”, in ML-family languages. Later, I also clarified what I think by “polymorphism” most usefully means. This post explores the less obvious relationships between polymorphism and debugging.]

When I talk to language-minded people about debugging code in ML-family languages, they tend to think that the difficulties are something to do with “the type system” and “polymorphism”. This is only half right. Polymorphism does complicate debugging, but it does so in every language, even ones with no “type system” of any kind. (As a sanity check, ask: does BCPL support polymorphism? The answer is clearly yes, at least according to my earlier definition.)

The axe I'm grinding in this series of posts is that “polymorphism” or “type-mumble” are no excuse for a lack of decent observability in ML-like languages. There are no major obstacles to implementing a nicely debuggable ML, and certainly none regarding polymorphism. Intuitively, this makes sense if we remind ourselves that polymorphism is to do with abstraction, whereas observation is done on the concrete: we're observing a concrete program state, laid out in front of us. (Of course, there are some unfortunate decisions taken by existing implementations of ML-like languages, that make retrofitting debuggability more difficult than it might be. That's a very different problem!)

A similar viewpoint explains why other kinds of fancy types present no obstacle. Even using fancy features of OCaml or Haskell, like GADTs or type classes, our program state still boils down to a big pile of atoms, sums, products and functions. The innovations in “type systems” have been in reasoning, specifically about safety. This is a question of dynamics: checking at compile time that invalid structures will never be constructed at run time. Observability isn't concerned with dynamics; it's about looking at the present, not the future. All we want to do is decode a static snapshot of the program. (Here I'm excluding the debugger feature of “altered execution”, i.e. enacting debug-time side-effects on the debugged process. How to do this safely is an interesting question, but I'm not going to dwell on it here.)

Can we expose more clearly why polymorphism isn't a problem? As I covered last time, “polymorphism” is a fancy word for deferring specialisation. Specialisation can be done in a compiler or at run time. At run time, specialisation means execution: execution is a specialisation process that culminates in the program's result. We can also think of this process as “decision-taking” in response to input.

Polymorphism during execution, witnessed in program snapshots, is very different from polymorphism in source programs. In source programs, the whole program is oriented around an unknown future: programs describe dependency on an “input”, supplied later. By contrast, observing a program state at run time is about decoding a present, not a future. Moreover, to help us do that decoding, we can exploit all the decisions that have been taken so far, i.e. all the specialisation that has occurred, during both compilation and execution, to reach the present state. Some of this specialisation can be called “monomorphisation”, because it has taken generic code and applied it in a specific context.

As before, I'll focus on OCaml. The OCaml compiler turns polymorphic source-level functions into generic run-time function objects (instruction sequences). Similarly, for polymorphic data types in source code, the compiler selects a size and layout, independent of any type variables that might be parameterising the definition. As we would expect, this is achieved using indirection: the fixed size and layout ensure that locally, storage can always be allocated and accessed generically. The specialised part of the data—for example, the payload of a list node—is indirected away, using pointers. If we were talking about C, these pointers would be pointers to void. OCaml's source language of data types lets us be more precise about these, by introducing type variables. But that's a meta-level bonus that helps us reason about dynamics. A snapshot of an OCaml program still reveals it as consisting of allocated objects and pointers between them, just like in C.

Viewed as source code, a program consists largely of functions and data types. But during execution, we have other things too: activations of functions and instances of data types. It's usually these that we want to inspect when debugging. For example, a backtrace is a list of function activations. The heap is a collection of values—instances of some type. (It also contains closures, which are a bit of both; I'll ignore them, for simplicity, but you can ask me at the end.)

Here is the key observation about polymorphism at run time. Whenever a polymorphic function is activated, or when a polymorphic data type is instantiated, some instantiation of its type parameters is morally decided. “Morally” means that we could define an oracle, using the semantics of the language, that tells us how they are being instantiated. For example, it could tell us that at some given allocation site, we're creating a list of int rather than a list of 'a (whereas the latter is all the source can tell us). Exactly what this means, however, is slightly subtle.

One subtlety is that the code doing the allocation doesn't necessarily know what this instantiation is. That code might itself be generic! So maybe we're building an int list list out of some int lists. The code doing this might only know it's building an 'a list list, but our oracle would still tell us that the allocation “morally” has the more precise type int list. Another subtlety is that, of course, there's no guarantee at the implementation level that our runtime is helpfully defining any such oracle for us. Nor need the compiler have provided us any output that would help us implement the oracle. In the case of OCaml, these are definitely not true, and that's precisely why it's difficult to add debugging support to the current OCaml toolchain.

Another subtlety is that the “instantiation” does not necessarily yield something free from type variables. Although our int list list example got rid of all the variables, in some other cases we might find the best we can do is to instantiate 'a with 'b -> 'c, say. But this turns out not to stop us from observing anything we might logically be able to observe. I'll return to this shortly.

One way to make OCaml debuggable might be to directly implement this oracle, by maintaining extra state at run time. Whenever we call a polymorphic function or instantiate a polymorphic data type, we could stash information somewhere that explicitly records how the type parameters are being instantiated. Something quite similar was done a while back in the HashCaml project. Unfortunately, it's a fairly invasive change to the compiler. It's likely to meet resistance via a performance argument, which you can think of this as the “frame pointer” debate but for type information. Pushing around extra information creates a bit more pressure on registers and memory, so typically shaves a few percent off performance. In return, we make observability massively more straightforward. Apparently opinions differ on whether this is a good trade. All I'll say is that if frame pointers are good enough for the Linux kernel, they're good enough for me.

Instead of tracking allocation types up-front, one could do a deeper analysis to recover the same information on demand. If we assume we have a backtrace, that gives us a powerful chunk of context: it describes a nest of function activations. The top-level activation (which would be main(), if OCaml had main()) is always monomorphic, so we should be able to figure out all the subsequent instantiations all the way down the stack. Or, we can flip that around: starting from a given activation, we should be able to figure out any type variable instantiations by looking some distance up the stack, and in the worst case, all the way to the top. Currently this is what my OCaml-implementing colleagues prefer; they expect it can work by looking no more than a few frames up the stack in the common case. The logic involved is basically the same as that of the compile-time type checker—which now needs to be replicated in the debugger and/or the language runtime. That's an annoying chunk of replicated stuff, which I find distasteful. Also, this inference might be expensive—fine for an interactive debugger, but poor for other applications of run-time type information (like serialization routines or a tracing tool, say). The advantage is that it requires fewer changes to the compiler.

A third option would be to relax our aim of recovering source-level types. In practice, we don't necessarily care, at debug time, that we're looking at an int list. It might be enough to look at each list node individually, seeing that it's a Cons, and then, separately, discover that each Cons points to an int. In this way we've avoided “typing” at the same granularity that the OCaml language does typing, but we've still recovered a somehow “typed” view of the program (i.e. one interpreted in terms of source-level data types). Put differently, source-level types like list encode information about a whole structure, spanning multiple allocations. Perhaps all we really need is a piecewise, per-allocation view. Currently, OCaml's tagged-pointer implementation ensures that at word granularity, we can distinguish integers from pointers. That's not enough, because we can't, say, distinguish the first variant of type T from the first variant of type U, nor from the integer 0: all are encoded as a zero word. But if we add local tracking of ADT variants and a few other things, that might be enough for observability purposes, and would be less invasive than a full HashCaml-style solution. I find this promising, although I'm still working through the consequences.

Suppose we stick with our oracle-based approach, tracking a source-level type for each allocated value. There seems to be a complication. I mentioned that type parameters are decided at instantiation points. but also that we might only be deciding that 'a becomes 'b -> 'c, say—we're not fully monomorphising them. This makes sense, and just reflects the nature of functions. Suppose we have a list of functions. A perfectly valid such list might contain the list head function hd. That's a generic function of type 'a list ->'a. When we instantiate our 'a list to one that can hold this function, we've specialised type parameter 'a to 'b list -> 'b. Our list is still polymorphic: we haven't got down to a monomorphic type. Does that mean we're lacking the ability to observe something in our program state? The answer is a resounding “no”! I mentioned that when debugging, we're looking at the present and not the future. The polymorphism in hd encodes the unknown future: we don't yet know what types of arguments the functions in the list will be applied to (it hasn't happened yet!). So, these polymorphic-at-run-time values in our heap represent the residual genericity in delayed computations, i.e. in functions. Functions encode things our program hasn't done yet, but might. They don't present an obstacle to decoding the current program state. In practice, any function has a name, even if it's a fake one generated by the compiler from the source code coordinates of a lambda. If we're in a debugger, getting the name of that function (or those coordinates) is comfortably good enough.

There's a final apparent complication. What about the empty list, or the null pointer? These seem to be polymorphic values. But unlike functions or data types, they're not going to get specialised further by activation or instantiation. The simplistic answer is that these values are okay because they're degenerate cases. It's not a practical loss of observability at run time if we can't answer the woodchuck-esque question of “what type of non-empty list would this empty list be if it wasn't empty?”. A more subtle answer is that these values aren't really polymorphic at all. If we think of how we would define the data type 'a list, we see that the Nil constructor, viewed in isolation, isn't polymorphic—it doesn't use 'a. In the context of this constructor, 'a is a “don't care” or ignored argument. An unparameterised constructor is only vacuously polymorphic: its meaning doesn't actually depend on the parameter. (This view, which sees constructors as somewhat independent of the type definition that encloses them, is one which OCaml's polymorphic variants directly build on.)

Finally, I alluded to some parallels with C. Just as the pointers which allow generic layouts for ML data types are equivalent to void pointers, so we have a similar problem when debugging C code: what's on the end of a void*? If I'm looking at a generic linked list node in my debugger, say, the debugger won't let me follow the pointer to the payload data. For that, we would need some run-time service that can look up some metadata about arbitrary memory locations and tell us what's stored there. Java-style VMs solve this problem using object headers. Clearly we don't have this in C; we need some extra infrastructure to answer these questions. I've been working on it: it's called liballocs. By dynamically tracking allocations in our running program, and using some carefully crafted associative data structures, we can build up a fast mapping from arbitrary pointers to metadata about the pointed-to allocation.

In fact the reason I got interested in this topic was that I wanted to make liballocs understand allocations made by OCaml programs. One of the complications liballocs has to deal with is polymorphic allocation sites. These sometimes occur in C code. For example, we might malloc() an array of generic void*, say, but actually use it to hold some specific kind of pointer. Genericity like this is occasional in C, and commonplace in ML. But there's no fundamental difference: code can be generic regardless of whether our source language includes a type language for describing that genericity. Genericity itself is what makes debugging tricky, because it indirects away concrete details in a way that some implementations (both C and OCaml) make hard to recover at run time. The presence of a fancy type system isn't the problem.

[/research] permanent link contact

Fri, 27 Feb 2015

Talking about polymorphism

[In my last post, I talked about debugging, and more generally “observability”, with reference to ML-family languages. This post is continuing the theme.]

Before I go into what's different about polymorphism between compile time and run time, I need to get some terminology straight. What is polymorphism anyway?

People traditionally talk about “ad-hoc polymorphism” and “parametric polymorphism”. These terms go back a long way— to Strachey in the the mid-1960s. They're unhelpful today, for two reasons. Firstly, and most trivially, some flavours of apparently non-parametric polymorphism are nevertheless rather systematic, so ideally shouldn't be called “ad-hoc”. Type classes in functional languages and method overriding in object-oriented languages both illustrate this.

Secondly, calling polymorphism “parametric” confuses people into thinking that if code doesn't have type parameters, it's not parametrically polymorphic. But, to me at least, this kind of polymorphism is not about how code is presented; it's a deeper property, about how code is expressed. It's easy to find code expressed in apparently polymorphic ways in all languages—even ones where there is no additional metalanguage of types to describe that polymorphism. Some C code is chock-full of such polymorphism—take Linux, for example. In such code there are no type parameters, only void. Similarly, one of the main strengths of dynamic languages is that code often comes out with some degree of polymorphism by default. If you write a sort function, say, you tend to get something that can operate on any kind of ordered sequence.

So although “polymorphism” sounds fancy, it's really just talking about delaying specialisation. If we're write code in a way that avoids commitment to details that would specialise it to one kind of data or another, we're writing polymorphic code. Fancy type systems are a means of reasoning about polymorphism, hence enabling static safety guarantees about polymorphic code. But they're not what enables polymorphism itself.

(I am probably departing from Strachey at this point, although it's not clear. His presentation doesn't directly address the question of whether polymorphism is really polymorphism without a metalanguage to describe it. For the purposes of my subsequent writings, I declare that it is, and that the metalanguage is a separate concern. Even if you disagree, it won't affect the substance of anything that I'm about to say; only which choice of words is an agreeable way to say it.)

Expressing code in a way that defers specialisation invariably means introducing some kind of indirection. Here I mean “indirection” in a very general sense. Run-time indirection via pointers is certainly one very familiar way of introducing polymorphism: a list node having a void* payload has avoided specialisation, whereas one with int hasn't. But there can also be compile-time indirection, such as between an identifier and its binding. And then there is also run-time indirection across not only storage, but also computation. (The latter is the trick which, when applied to its fullest extent, gives us objects in the sense of Cook.)

Somewhere along the line I have stopped using the qualifier “parametric”. In fact, the view I've just expounded—that polymorphism is the deferral of specialisation—covers ad-hoc polymorphism too, if we look at it the right way. When we say a source language supports “ad-hoc polymorphism”, it means that it lets us group together collections of specialised definitions. Such a collection of definitions, viewed as a unit, start to look like a single, unspecialised definition—in other words, a polymorphic definition. Again, indirection is what enables this polymorphism. If I write the token “+” in a language with overloading or type classes—Haskell or C++, it doesn't matter—it denotes the whole collection of defined addition operators. The code thereby avoids directly selecting any particular one. Typically, the selection is done indirectly. Usually it happens in the compiler, since the language defines some rules for selecting which one will actually be used. But it might happen very late in compilation since these rules might allow dependencies on the code's context of use (think how “+ is resolved in a C++ template definition, for example). And all this could equally well happen at run time; multimethods allow us to do this for binary operators like plus, while the usual single dispatch is implementing precisely the restricted unary case: deferring specialisation, on (only) the leftmost argument, until the execution of each individual call.

If polymorphism is this incredibly general thing, we need some way of getting down to specifics—like the question I started with, about what “polymorphism” is present during execution of compiled OCaml programs on a particular OCaml runtime. One thing that helps is to qualify any mention of polymorphism with a particular time: perhaps compile time, perhaps run time. At any given time, we can describe a polymorphic entity (perhaps a definition in source code, perhaps an object at run time) as somewhere on a spectrum between strong “specialisation” and strong absence of specialisation, which we can happily call “genericity”. (I say “strong” not “full” because the logical extremities of this spectrum tend to be degenerate, like “the output of the program for given input”—the most specialised state!—and “not written any code yet”—the most generic possible state.)

This “one time” restriction forces us to distinguish a source-level view from a run-time view. It's clear that this distinction matters. Polymorphism in source code can sometimes be implemented by specialisation before execution. C++ templates are the classic example: I can define a generic template, but the compiler will implement it by generating lots of specialised instances. Meanwhile, certain whole-program optimisations in ML compilers, like MLton, do effectively the same. This kind of compile-time specialisation is sometimes called “whole-program monomorphisation”. Just as “polymorphism” is a fancy word for the deferral of specialisation, “monomorphisation” is a fancy word for applying some kind of specialisation.

Polymorphism only defers specialisation; it doesn't avoid it. As I'll talk about more in the next post, all polymorphism eventually ends up being specialised away somehow. What varies is where and when this specialisation occurs. It might be done early, in a compiler, or late, during execution, by “elaborating” a particular trace of machine instructions in a particular machine context—i.e. by actually running the generic code! In the latter case, each execution is “specialised” in the sense that it runs in a different machine context and so produces a different outcome. Either way, execution is a gradual process of specialisation, entailing the gradual removal of polymorphism.

Phew. Perhaps that was a lot of belabouring the point. But it certainly helped me clarify (to myself) in what ways “polymorphism” might be present in ML code at run time. I'll revisit this precise question in the next post.

[/research] permanent link contact

Fri, 20 Feb 2015

Putting observability first

[Update: this article has now been translated into Russian! Thanks, Vlad.]

Last week I had a fascinating conversation with Mark Shinwell and Stephen Dolan, two colleagues who know the OCaml runtime inside-out. We were talking about ways to make compiled OCaml code observable: making it support interactive debugging and various other dynamic analyses (notably memory profiling).

It turns out that although this is not impossible, it is very difficult. It's fair to say that the OCaml runtime has not been designed with observability particularly high on the wishlist. It's far from unique in this: ML-family languages are usually not implemented with observability in mind. In fact you could argue that ML is explicitly designed for non-observability. The orthodoxy of the day held that tagged storage was the naive implementation technique, and erasing tags at run time was a good thing—mainly since it enabled a modest performance gain. Moreover, this meant that the provable erasability of tags came to be considered a valuable mathematical result, about language designs in general and ML in particular. The lingering side-effect is that this erasure also removes the ability to decode program data straightforwardly from its in-memory encoding. In the absence of some other mechanism—which mainline OCaml doesn't currently have—this greatly compromises observability.

ML is not alone in this. Although I'm not so familiar, I'll wager that Haskell, in GHC, suffers similar observability limitations—or perhaps worse, since it does more extensive compile-time transformations, and these cannot easily be “seen through” at run time. This makes it even harder to explain the state of an executing program, seen at run time, in terms of the source code. Also, I'm not singling out functional languages. Java has serious flaws in the observability mechanisms exposed by its virtual machine, as I once wrote a short paper about.

C, as usual is an interesting case. On the surface, it appears to be designed for little or no observability. It certainly doesn't generate much run-time metadata. This omission could be seen as an ML-like erasure of implicit tags. Yet this is not an accurate representation. Pragmatic working programmers, like Ritchie and Thompson, know well the value of interactive debugging. The first edition of Unix included db, an assembly-level interactive debugger mostly used on coredumps. By the eighth edition two debuggers, adb and sdb, were documented in section 1 of the manual, the latter supporting source-level debugging, while a third debugger (pi; built on Killian's procfs) was bundled in the extras which would become Plan 9. More modern debuggers have kept pace (more-or-less!) with more advanced compiler optimisations, using complex metadata formats like DWARF to describe how to see through them. (This isn't bulletproof, and has its own problems, but works remarkably well for C.) The result is that C code has been consistently highly observable—albeit not without forty years' continuous toil to co-evolve the necessary language- and system-level infrastructure.

This co-evolution is interesting too. The mechanisms for achieving observability in C code lie outside the language. Coredumps and symbol tables are system mechanisms, not language mechanisms. Observability in Unix has been part of the design, down in the fabric of the system; it is not an afterthought. An opinionated system designer (okay, I'll step forward) might claim that observability is too important to leave to language designers. There are, however, some gaps to plug and some mindsets to bridge in order to apply Unix-style debugging infrastructure to ML-like languages. In another post in the near future, I'll dig a bit deeper into OCaml, by considering polymorphism (and why it's not quite the problem it seems to be).

[/research] permanent link contact

Mon, 02 Feb 2015

Thoughts on peer review

My PLDI submission was rejected. I'm not too sad about this, since the reviews were basically encouraging, and on balance I agree that the paper can still be improved and was arguably not (quite) ready yet.

However, despite certain innovations, namely two-phase review and rebuttal, the conference review process is as creaky as ever. This is rather dispiriting, and the longer I spend doing computer science research, the more bizarre our system seems. I filled in the anonymous post-conference survey, but since my comments apply more widely than to a single instance of a single conference, I feel like forgoing my anonymity and re-posting them here (lightly edited). I hope they don't have too much of a sour-grapesy flavour—at least, that's not really how I'm feeling.

Something I didn't elaborate in my survey response: what would this “progressive model” be? It might involve reserving a smaller part of the programme for short presentations of recent journal publications, and a larger part of the programme for short presentations of high-quality in-progress work. This would be work at a substantially more advanced stage than a typical workshop paper—possibly reasonably mature already, but not yet accepted for publication in a journal. Initially, it seems important to ring-fence the slots for the latter kind of presentation, to avoid having already-published work always trump the not-yet-published stuff. Eventually, in a well-functioning community there would not be much call for the former kind of slot, since most journal publications would describe things that had been presented at previous conferences.

[/research] permanent link contact

Tue, 16 Dec 2014

For and against languages

A colleague recently summarised my research's position, not unkindly and partly in jest, as “against languages”. One one level there's some truth in that. I decry language-specific toolchains, runtimes, debuggers and package managers; I scoff at claimed benefits of being “100%” “pure” <your language here>, I am endlessly sceptical of anyone trumpeting one language too loudly. In my own work, I am preoccupied with language-agnostic runtimes, linkers and debuggers, cross-language composition and language-agnostic program analyses.

But that doesn't mean I dislike languages—far from it! Language innovations deliver their highest net value when we're free to pick the right one for the job, and when it's possible to adopt new languages without suffering massive pain. This also means it must be easy to migrate away from a language. It's only by humbly accepting any one language's inevitable imperfection that we can use each language to its best advantage, and can accommodate progress in languages generally.

Not all code is alike, and the world doesn't change overnight. We should therefore expect code in multiple languages to coexist. The alternative is getting stuck with the “current” and never making it into the “new”. We need to think of coexistence of languages not as a one-time transitional measure on our way to the perfect language, but as inevitable, ongoing reality. Even if we consider a particular language to be “legacy”, “outdated” or “broken”, we shouldn't make dealing with that language a second-class use case. If we are to have progress, we must give these scenarios first-class consideration. (This is why I lambast the status quo of “foreign function interfaces”, and in fact the very idea of them.)

So it's not that I'm against languages. I'm interested in making language innovations as useful as possible. I'd say that makes me more strongly in favour of languages than many people.

[/research] permanent link contact

Mon, 24 Nov 2014

I hate systems research... sort of

[I wrote this article on 6th August 2008, when I was a second-year PhD student, but then didn't have the guts to post it. Looking back, it's not so controversial, although it was obviously shaped by my (relatively narrow) experiences as a PhD student in a particular corner of the Networks and Operating Systems group here in Cambridge (a group I am very happy to be associated with, I should add).]

David Byrne, a well-known proponent of musics from around the world, once wrote an article entitled “I hate world music”. He hadn't experienced a sudden change in his taste, but was referring to the label of “world music”. By labelling all music that doesn't conform to a certain narrow set of Western-heritage styles as “world music”, it can be conveniently marginalised and dismissed as a unit.

So too with computer science research, labels are used to divide and marginalise. This time, it's the inside marginalising the outside. A recent EuroSys call for papers invited papers on “systems aspects of...” a long list of topics. Despite having been working in a systems research group for nearly three years, nobody has ever told me what “systems aspects” are. Yet most people who'd describe themselves as “systems researchers” seem to wear that badge with some degree of pride. Until someone can tell me what it means and why it's such a good thing, I won't be doing the same.

What defines systems researchers? Is it that they “build stuff”? This is perhaps necessary but hardly sufficient. Is it that they meddle with operating systems, networking stacks or other infrastructure software? Perhaps, but it seems to cover a lot more ground than that to which the label is commonly applied. My preferred take is that “systems” is simply a label, of historical origins, which now identifies a research community—a community which, like all social networks, is defined only by the links among its members and the continuity of that relation's evolution over time. One thing that has continued to surprise and frustrate me about the world of research is that despite huge overlaps of interest, it is divided much more than it is united: divided into largely separate, non-cooperating communities. Many systems researchers would be horrified (thanks to a snobbery I will come back to) if you claimed that their work overlapped with that of the software engineering community. Yet having attended in the last year both EuroSys and ESEC/FSE, I estimate that at least 40% of the papers I saw at each of those conferences could easily have been presented at the other without any sense of incongruity whatsoever.

Last week at the Semantics Lunch, Mike Hicks gave an excellent talk about CMod, the module system for C developed at the University of Maryland. Unlike other Semantics Lunches, the advertisement was cc'd to my research group's list. I get the e-mails anyway because I'm subscribed to other lists, but the obvious inference is the impression that systems researchers will be interested in any languages research that concerns C, but not in any other. This impression is certainly not fair, but the exaggerated impression does indeed stem from a grain of truth.

One way of getting your tools- or languages-oriented paper published in the systems community is to show how it applies to “systems code”. This almost always means one of three things: performance-critical code, resource-constrained code, or code written in C. These are all pretty general requirements, and fully relevant to the widest imaginable “software engineering” community. Regarding the last of the three requirements, of course there are many great reasons for creating tools that work with C, given the astronomically huge volume of existing software that is written in it. Working with C is certainly a priority for my own work. However, everyone knows that C is a language with many gigantic problems, and ones that have been known for decades. Cynically, I would claim that part of the reason that C-language developments, both code and tools, continue to flourish within the “systems community” is an endemic disdain for programming language advances, or in general of any research community that isn't “systems”.

This separatism comes in more than one form. There is a certain air of what I call “macho bullshit”, where, among other things, there is pride is associated with writing unnecessarily nasty code in unnecessarily low-level languages. Now, I have seen some exceptionally clean and well-engineered C programs, particularly ones coming out of my own research group, so this isn't any sort of blanket statement. I've also seen, again locally, many systems researchers who are heavily into language advances. But the other side of the coin is there too: the disdain and the separatism are definitely things I've observed around here, and I don't like them. They're joined by an attitude to criticism: there appears to be an unwelcome degree of sniping and dismissiveness, found in reviews and programme committees, together with a bizarre culture of unnecessary controversy, as typically found in workshops with “Hot” in their title. It seems there are too many systems researchers prepared to forward any criticism of others' work, however invalid, and pronounce any self-aggrandising position, however obtuse and misleadingly argued. This is, in a word, unprofessional, and it undermines the very goals of research. I don't have enough experience to know either how widespread these attitudes are, or how much more systems research suffers from them relative to other fields.

[/research] permanent link contact

Wed, 19 Nov 2014

How to write vaguely acceptable makefiles

It's not hard to write a clear, orderly makefile if you respect it as a programming job in its own right, and adopt some principles. I rarely see such principles written down. Here's what I've gravitated towards as I've learnt to use make more effectively.

[/devel] permanent link contact

Tue, 07 Oct 2014

Seven deadly sins of talking about “types”

[Update: this article has been translated into Japanese!]

My essay “In Search of Types” attempts to be a dispassionate review of some of the different concepts, purposes and attitudes surrounding the word “type” in programming. I imagine that my passions are still rather thinly veiled in places. But in this post I'm going to come out with a more unabashedly ranty take. Certain statements and attitudes really get up my nose. I was recently exposed to a few of these at Strange Loop (a great conference, I should add). So I've written down a list of “deadly sins” committed by many people (mis-)speaking about “types”

I should add that the theme here is rhetoric. What gets up my nose is when people don't make honest, transparent arguments. Their conclusions needn't be wrong. I program in OCaml a reasonable amount, and it's plain that I get a lot of value from the type checker. But advocates of type systems often try to sell them as a miracle cure, without acknowledging their limitations and undesirable side effects. Can we please move the debate past propaganda and blanket statements?

A contributing factor is that we often struggle to disentangle the many distinct concepts that lurk under the word “type”. My essay tackles this entanglement more directly, although hopefully the following rants will make some useful distinctions.

Not distinguishing abstraction from checking

This is my number-one beef. Almost all languages offer data abstraction. Some languages proceed to build a static checking system on top of this. But please, please, don't confuse the two.

We see this equivocation used time and time again to make an entirely specious justification of one language or other. We see it used to talk up the benefits of type systems by hijacking the benefits of data abstraction, as if they were the same thing.

The discussion of “stringly-typed programming” (sic) at Strange Loop, both in the Snively/Laucher talk and in Chris Morgan's Rust-flavoured talk, was doing exactly this. Yes, it's true that HTTP headers (to borrow Morgan's example) can and should (usually) be abstracted beyond plain strings. But failing to do so is not an omission of compile-time type checking. It's an omission of data abstraction. Time and time again, we see people advancing the bogus argument that if you make the latter omission, you must have made the former. This is simply wrong. Type checkers are one way to gain assurance about software in advance of running it. Wonderful as they can be, they're not the only one. Please don't confuse the means with the ends.

Pretending that syntactic overhead is the issue

The Sniveley/Laucher talk included a comment that people like dynamic languages because their syntax is terse. This might be true. But it made me uncomfortable, because it seems to be insinuating a different statement: that people dislike type systems only because of the syntactic overhead of annotations. This is false! Type systems don't just make you write annotations; they force you to structure your code around type-checkability. This is inevitable, since type checking is by definition a specific kind of syntactic reasoning.

To suggest that any distaste for types comes from annotation overhead is a way of dismissing it as a superficial issue. But it's not superficial. It's about the deep consequences type checking has on how code must be structured. It's also (perhaps ironically) about polymorphism. In a typed language, only polymorphism which can be proved correct is admissible. In an untyped language, arbitrarily complex polymorphism is expressible. Rich Hickey's transducers talk gave a nice example of how patterns of polymorphism which humans find comprehensible can easily become extremely hard to capture precisely in a logic. Usually they are captured only overapproximately, which, in turn, yields an inexpressive proof system. The end result is that some manifestly correct code does not type-check.

Patronising those outside the faith

If somebody stays away from typed languages, it doesn't mean they're stupid, or lack education, or are afraid of maths. There are entirely valid practical reasons for staying away. Please drop the condescending tone.

Presenting type-level programming as a good thing

No sane person can argue that we want to bifurcate a programming language into distinct base-level and type-level fragments. Gilad Bracha calls this the “shadow worlds” problem: to abstract over constructs at level 0, we need a whole new set of level-1 constructs, and so on. This is an antipattern. It's a failure of compositionality. That doesn't mean it can't be justified for pragmatic reasons. ML's module system is the way it is because nobody has figured out the maths to do it any better under the various design constraints of the system (which of course include a proof of soundness). But please, stop pretending that type-level programming is a desirable thing. It's a kludge, and hopefully a temporary one. Personally, I want to write all my specifications in more-or-less the same language I use to write ordinary code. It's the tool's job to prove correctness or, if it must, tell me that it failed. (It should also do so using a straightforward explanation, ideally featuring a counterexample or stripped-down unprovable proposition. I don't know any type checker that does this, currently.)

Fetishising Curry-Howard

Curry-Howard is an interesting mathematical equivalence, but it does not advance any argument about how to write software effectively.

Equivocating around “type-safety”

Here is a phrase which is used to mean many different things. Old-fashioned “type safety” is what I call “Saraswat-style”, after the famous “Java is not type-safe” memo. (Of course, it wasn't invented by Saraswat—it was the “standard” meaning of the phrase at the time.) It's a valuable property. But it's actually about memory, not data types: it's definable in terms of a language featuring only a “word” data type (using only minor tweaks of wording from Saraswat's). It's also nothing to do with static checking—“type safety” holds in all “dynamically typed” languages. The reason it's such a useful property is that it can be implemented without sacrificing a lot, unless you want to program close to the machine. Although many implementations happen to use syntactic compile-time reasoning, a.k.a. type checking, to lessen the run-time checking overheads, this is an implementation detail.

Saraswat-style type safety is usually a very good thing. But it's a far cry from proving near-arbitrary correctness properties of code. It's popular to confuse the issue, by saying that if you don't use type systems you don't have “type safety”. This wilful mixing of separate ideas is cashing in on the relatively uncontroversial good value of Saraswat-style safety, and using it to paint “type systems” as a similar no-brainer. Proof has a cost, and the appropriate expenditure depends on the task. It's far from a no-brainer.

Omitting the inconvenient truths

If everyone would just use a modern language with a fancy type system, all our correctness problems would be over, right? If you believe this, you've drunk far too much Kool-Aid. Yet, apparently our profession is full of willing cult members. Let me say it simply. Type systems cannot be a one-stop solution for specification and verification. They are limited by definition. They reason only syntactically, and they specify only at the granularity of expressions. They're still useful, but let's be realistic.

Try checking realistic reachability or liveness properties with a type checker, and you will not succeed. In a formal sense, this can be done, and has been, but the resulting system has little in common with type checking as practitioners would recognise it. The authors note that the ostensibly type-style specifications it yields are “bulky” and often “not suitable for human reasoning”. This is hardly surprising, since they are working against the grain of the program's existing syntactic decomposition. To specify and check liveness or reachability, we need to represent programs as some kind of flow graph, where flows are very likely to span multiple functions and multiple modules. It's not an expressionwise view of code, and it's also not the syntax in which most people want to code.

We need to grow up and accept this. Once we've accepted it, what do we do differently? We need a model that lets us assimilate gradual improvements in automatic reasoning, like better SMT solvers, without requiring us to re-type (pun intentional) our code. We need to be able to step up the strength of our specifications without switching language all the time. We need to integrate typed and untyped code, and between distinct languages too.

These ideas are slowly gaining some traction, such as in gradual typing, but have a long way to go. In particular, accommodating multiple languages has some far-reaching consequences for our infrastructure. So far, we implement languages in a particular way: relying on a single in-compiler type checker to establish whole-program invariants. If we embrace multiple languages, we can't build things this way any more. Instead, it's necessary to tackle invariant enforcement not within a per-language compile-time service but within a more generic composition-time service. (Hint: it's called a linker. In short, the linker-like service must do a mix of proof and instrumentation, in a language-neutral way, guided by descriptive information output by the compiler. If you've seen me talk recently, this will probably sound familiar. Ask me for more!)

Coding without proving everything we'd like to prove is an activity which will continue. Meanwhile, late binding, and hence limited applicability of truly ahead-of-time reasoning, is often a requirement, not an implementation decision. Eliminating these can bring benefits, but also can eliminate value. Weighing up costs and benefits, in a way that optimises overall value for a given product, is the right way to think about all these issues. It's a long way from the dogma, rhetoric and blind faith that currently dominate our discourse.


Now that I've ranted my heart out, here's a coda that attempts to be constructive. As I mentioned, part of the problem is that we equivocate when talking about “types”. Instead of saying “type” or “type system”, I encourage anyone to check whether anything from the following alternative vocabulary will communicate the intended meaning. Deciding which term is appropriate is a good mental exercise for figuring out what meaning is actually intended.

If you can think of others, do let me know!

[/research] permanent link contact

Mon, 06 Oct 2014

Project extra

I just thought of another nice project idea, so here it is.

A generic fuzz-tester, using DWARF

Fuzz-testing is a technique for randomised software testing. The software under test is run with randomly modified inputs, starting from existing test inputs. This covers more code paths, hence potentially finding more bugs, than using only human-crafted test inputs.

Typically, fuzzers are built around particular input domains. For example, we might write one fuzzer for C compilers which generates randomised C source files. We might write another fuzzer for X11 applications which randomly modifies the packets sent over the wire. (In fact the original fuzzer, xjig, did exactly this.) We might write yet another fuzzer for a particular library API, like John Regehr describes here. This works... but can we build a more powerful, more general fuzzing system than these per-domain solutions?

This project is about building a general tool for the latter scenario: fuzzing of library APIs. We want to be able to fuzz any library whose API we have a description of. For us, this will mean described by compiler-generated DWARF debugging information. There are several technical steps to this. Firstly, we need the ability to observe and (optionally) capture API traces, typically from running an existing test suite. This can be done using something reasonably off-the-shelf like ltrace, although we might want to make some modifications to suit our purposes. Secondly, we need to perturb these traces somehow to generate randomised versions. These can be randomised both in terms of the calls made and the arguments passed; the project would need to investigate various randomisation strategies. Thirdly, we then execute these randomised traces and attempt to detect any errors that occur—perhaps assertion failures reported by the program itself, memory errors reported by Memcheck (of Valgrind fame), or type errors reported by libcrunch.

For evaluation, we would like to measure how much we improve test coverage, relative to the existing test suites. We could perhaps also compare this improvement against the improvement obtained by API-specific fuzzers, like Regehr's, hoping to do almost as well (or perhaps even better!).

One problem with fuzzing is false positives. It might be that our randomised changes yield traces which exercise the API in way that aren't supposed to work. Normally we'd want only to generate traces in which the client stays in-spec, while perhaps leading the library itself to go out-of-spec (that's what a bug is!). (In security-flavoured fuzzing, the exposed attack surface is what's important, not the documented interface per se, but the same principle applies.) Such specifications are rarely written down precisely! So, extension-wise, an obvious direction is to refine our model of APIs to allow for more semantic constraints. To pick a trivial example, if we were testing the C library's malloc implementation, one constraint is that we shouldn't double-free a chunk. That's a little subtle—it's okay to free the same chunk of memory a second time, iff it was issued by malloc a second time! There is a lot of scope for investigating this kind of constraint, and, in general, to produce more sophisticated semantic models of APIs. There is a wealth of research about “API usage patterns”, but often using only a naive finite-state formalism that struggles to capture whole-trace properties (like the malloc one I just described). We could also investigate using Daikon, or some similar invariant generator, for inferring such invariants from collections of harvested traces.

[/research] permanent link contact

Mon, 29 Sep 2014

Progress by distillation

A theme of Joe Armstrong's Strange Loop keynote was that creating stuff doesn't necessarily make progress overall. Proliferation reduces value. Entropy is our enemy. He mentioned the bewildering array of build systems, packaging systems, unidentifiable system software (I never did catch what “grunt” is) and, more generally, copies of the same or similar code and data. Building a “compressor” was his likeably vague solution.

My talk arguably continued this theme. I was talking about proliferation of dynamic languages—in contrast to the original Smalltalk vision, which didn't envisage the host of broadly similar dynamic languages which now coexist alongside it.

An underrated way to make progress is by distillation. It's to recognise repetition, to integrate and unify existing systems where possible. We don't have to do so perfectly or completely. It doesn't have to be a Grand Unifying Design. We can work opportunistically. But we need both the right mindset and the right infrastructure. Tools for creating are not enough. Tools for integrating and distilling are necessary.

I seem to have written about this before. It's nice to see the same ideas cropping up in a keynote!

[/research] permanent link contact

Project ideas 2014--2015

I maintain a brief list of ideas for student projects which I'm always willing to supervise. Here I'll elaborate on some that I'm particularly keen on this year. As always: these are just a selection. I'm very interested in having students discuss their own ideas with me, or suggesting variations on what I have proposed. Please contact me if any of these ideas (or those on the linked page) is of interest. Most of these ideas could make Bachelor's or Master's projects, with appropriate tailoring which we would discuss.

Program slicing at debug time

Program slicing is a powerful program transformation. It seeks to create a minimal program that explains some strict subset of a program's state at a particular point in an execution, throwing away code that did not influence that subset. For example, suppose I'm debugging the following (very stupid) program.

y = 12;
z = y * foo();
x = y + 1;

I might see, when debugging, that after the third line, variable x has the value 42. I want to know how it got that value. A program slicer could show me a minimised version of my program that eliminated any statements that did not contribute to the value of x. Here, that's the middle statement (probably! that's if foo() did not change y somehow).

Although Mark Weiser's original paper emphasises the fact that programmers do program slicing in their heads when debugging, automated slicing still hasn't entered the usual repertoire of features that programmers find in their debuggers. This project will implement and evaluate a slicer that is designed to be usable from a debugger. The evaluation is likely to be in size comparison with an existing slicer—smaller slices are usually better. We can also measure the speed of the slicing algorithm—since to be usable interactively in a debugger, it shouldn't be too slow.

A simple slicer is not too much work. There are also many directions for extensions. Precise slicing is a Turing-complete problem, so all approaches are approximate. Better approximation through cleverer program analysis is the obvious avenue. Another is amorphous slicing, which explores semantics-preserving changes to the program's syntactic structure, in order to produce a still smaller program. A third is to use knowledge of the program state itself as an extra constraint on slicing—not just the position we want to slice at, but also the values of program variables. Slicing can be seen as a kind of backwards program execution, with a user-customisable degree of concretisation introduced. The “most amorphous”, “most concrete” slice is simply the relevant subset of the (branch-free) trace of program execution (if it's unique; else the union of all possible traces leading to that state). This might not be as useful as a slice preserving slightly more of the program's execution.

There is also an incidental problem that will need treating somehow. It is that most debugging infrastructure doesn't directly you reconstruct the original source code that you're debugging (at least in the cases of native Unix-style debugging and Java debugging). Instead it enumerates the source files that went into each compilation unit, together with the line number ranges of each, and provides a shallow mapping of program counter values onto these files and lines. In Java this is very nearly enough; for C, it's less good because the preprocessing step is not explicitly described. So, there might be some insight to be had about how source files ought to be represented within debugging information. One extension might implement such an improvement, by judicious hacking on a compiler (most likely LLVM) to modify and/or supplement the debugging information that it currently outputs.

A garbage collector using liballocs

My work on liballocs is developing a run-time library that dynamically tracks allocations and their data types across (potentially) a whole process. This means you can ask it what's on the end of a pointer, and it can usually tell you—even for pointers to stack or heap memory, interior pointers, and so on. It can be thought of as implementing a reflection API, as a platform for building dynamic program analyses, and as a platform for implementing dynamic languages (see my talk!).

This project is about implementing a mark-and-sweep garbage collector on top of liballocs. This can be thought of as a replacement for traditional conservative collectors like the Boehm collector. If liballocs's view of the process is completely accurate, and no pointers are computed in way that we didn't anticipate, then we obtain a precise collector. To deal with unusual address computations or tricky encodings of pointers, we might still need to build in some conservatism nevertheless.

In any case, the liballocs API allows a collector to be more “federated” in its design. Traditional collectors are owned by a single virtual machine (VM) which makes strong assumptions about how objects are laid out, what addresses are issued, what parts of the process can point into “their” heap (the “roots” of tracing), and so on. With liballocs, a collector can make a reasonable attempt at tracing native data structures directly, without knowing in advance how they are laid out—the library returns metadata describing the layouts. This allows tracking reachability over paths that stray outside a single region and then reach back in—for example, from the collected heap into the malloc() heap and back.

This “federated” design is potentially (i.e. in future work!) an enabler of reasonably seamless programming in a mix of languages, or when mixing manual with automatic storage management. For example, interfaces like JNI force the programmer to explicitly signal to the collector when native code takes a reference to a Java object. by creating a “global references” (which is managed manually). With liballocs, it can simply trace the native data structures too.

Although liballocs can usually tell you what data type is instantiated on the end of a pointer, there are a few exceptions to this (such as on-stack temporaries), and this isn't quite all the information we need. For collectors, it matters whether storage is initialised (hence meaningful) or not, and whether a value is live or dead. It also matters what funky address computations a program might do (all the way up to extreme cases like XORed pointers!). A large part of this project will mean finding some kind of solutions to these, as well as gathering an understanding of when they do and don't matter in practice. Most of these issues have “big hammer” solutions which work fairly well but are slow (like replacing malloc() with calloc() to avoid uninitialised heap data) and also more clever solutions (like skipping this if we can prove that an allocation site always initializes the whole heap block before the next safepoint). So there is plenty of scope to pick extensions to the core project.

Obvious comparisons for evaluation are the Boehm GC, and also perhaps the garbage collectors of the Go or D languages. We can compare in both speed and precision (or, inversely, conservatism). The Boehm GC is well known for having observable conservatism, keeping objects alive when they are not reachable, because of integers that alias pointers. We would expect to do better than Boehm for precision, while retaining comparable performance (within a factor of 2 for total memory management time overhead, say). An MPhil project would set more ambitious goals and use more advanced garbage collection algorithms (e.g. a generational and/or compacting collector).

A bounds checker in libcrunch

My work on libcrunch uses liballocs to implement run-time type checking for C (and other languages, like C++ and Fortran, in due course). The basic idea is that whenever a pointer cast occurs, we check that the cast-to type “matches” what's on the end of the pointer (for a sufficiently refined notion of match). One weakness of libcrunch is that it assumes the program is memory-safe, both spatially and temporally. Put differently, it will catch type errors caused by bad cast logic (analogous to a ClassCastException in Java) but not those occurring as a consequence of buffer overflows (“spatial” errors), nor use-after-free or use-before-initialize behaviours (“temporal” errors). This project will implement spatial correctness checking that integrates with libcrunch.

Unlike conventional spatial bounds checkers, like SoftBound or ASan, we have type information (from liballocs) so can do better in various ways. Most significantly, there is no need for per-pointer metadata; per-object type metadata is enough (this holds for C; ask me for the detailed reason why!). This means we don't need to move metadata around as we move pointers around, so we should usually see some performance wins. It also doesn't matter if we haven't instrumented all code; as long as we observe allocations, we will have the necessary metadata. However, there are some drawbacks: querying pointer metadata will involve a short search, via the containing object's metadata, rather than a direct-mapped lookup as in SoftBound. The main target of the project should therefore be to go roughly as fast as SoftBound. Lower memory overhead and lower virtual address space usage are also likely benefits of this approach; these can be measured.

System call interposition, specification and emulation

To understand the precise semantics of a user-level binary program running atop a commodity operating system, we must understand not only the instructions it executes but also the system calls it makes. Currently, we are making progress on formally specifying instruction set architectures, but the system call interface remains surprisingly murky. What system calls may a program make? What, precisely, are the valid arguments to each system call, and what do they mean? Given some arguments, what memory might a given system call touch? What other effects might it have on the program state?

This project will build a toolkit for observing, specifying, intercepting and emulating system calls as made by real, large programs. As a starting point, we have a basic (rather limited) preexisting infrastructure for trapping system calls in an x86-64 Linux program, and a sketch of a domain-specific language (not set in stone) for describing a system call interface. The project will produce a usable end-to-end system for intercepting, specifying and emulating system calls. In a basic usage, we would the calls back to the underlying operating system and observe it (e.g. using SystemTap or DTrace) to check the accuracy of our model. In a more advanced usage we would instead be running the program in an emulator, and would update the emulator's state to reflect the effect of the system call. The intention is to have this work for various pairings of OS kernel (Linux, FreeBSD) and instruction set architecture (Intel, Power, MIPS, ARM), although at least initially, some pairings are more important than others.

This project is very researchy, and has both practical and theoretical aspects. It's motivated by the research we're doing in the REMS project.

A linker, mostly functionally

Much like compilers, linkers have a crucial influence on the meaning of the eventual binary program. Current linkers are big piles of imperative code, with little explicit specification. This project concerns building either a static or dynamic linker for ELF binaries in a mostly functional style, in a way that will yield a clear specification of the various parts (relocation, address assignment, section concatenation or merging, etc.). Ideally the linker would be written as an “executable specification”, likely in the Lem language, making explicit the points where nondeterministic choice is taken within the written specification (as far as it exists).

Either a static or dynamic linker can be attempted; in the dynamic case, we have a very basic skeleton already. It won't be feasible to implement a fully-featured linker, but we will carve out some subset depending on interests.

An optimising linker

Traditionally, linkers do not understand the instructions in a program. Instead, they blindly patch binary code in predefined ways, such as “add n bytes to the four-byte signed integer at this address” (which happens to be the relative-address field inside a call instruction, say).

However, a linker is potentially a very good place to do optimisation, because interprocedural flows are necessarily visible to it. A dynamic linker is potentially a very good place to do dynamic optimisation, because it can observe code being loaded and unloaded, hence can perform optimisations speculatively yet safely. So, there is a case that linkers should understand instruction streams.

Current toolchain support for link-time optimisation (LTO) is limited, in both gcc and in llvm, by working on the toolchains' intermediate representation. This must somehow be propagated into the input binaries—the binaries must be “compiled for” link-time optimising. An alternative approach is to bite the bullet and teach the linker about instruction streams, so that it can disassemble and re-optimise the instructions directly, likely using debugging information as a source of type information where this is helpful.

Some other interesting applications of link-time instruction stream rewriting include whole-program instrumentation (e.g. to intercept certain procedure calls, system calls, or certain memory accesses, such as a generational garbage collector's write barrier), reference widening (to overcome the complexity of code models) and speculative dynamic optimisation (e.g. to do “devirtualisation” of indirect call instructions). One of these could perhaps be addressed as an extension.

The chief evaluation will be on performance improvements. There are also benefits in terms of binary size, relative to traditional LTO toolchains, which can be measured too.

[Update: I added another project suggestion in a separate post.]

[/research] permanent link contact

Wed, 02 Jul 2014

Why isn't verification standard practice?

Yesterday, Steve Crocker gave a very stimulating talk sharing thoughts—and also soliciting thoughts—about why verification isn't standard practice. He began with an anecdote about last year's BIND vulnerability (CVE-2013-2266) that could allow a single malicious packet to crash a DNS server. The problem is a simple bug in a regular expression library. Malicious clients can exploit the bug to make the process allocate huge amounts of memory, and hence kill it. How can we justify the fact that such simple bugs get into deployed code? Why can't we verify the absence of at least these simple bugs in anything that we deploy?

There were many useful contributions from the audience. It took me a while to collect my thoughts, but here are some personal responses that weren't discussed at the time.

Coding versus deployment

The BIND problem is primarily one of deploying shoddy software, and only secondarily one of programming shoddy software. Even with the best possible tools, making something robust will always take time and effort. Our problem is that we don't have good ways by which to make an informed decision about whether something is good enough (and if not, what needs fixing). This is re-stating a theme of the discussion: that our problem is at best only partly a technical one. I do believe that technical solutions and cultural solutions must go hand-in-hand: changing the technology can change the culture.

Proof as an activity much like programming

Considerable progress has been made in proof assistants like Coq and Isabelle, where proof starts to look like a programming task. In particular, the machine helps us do it and helps us check it. This is a really useful advance. Programmers are already used to satisfying a proof procedure, if they use any language with a type checker. But that doesn't mean we need to restrict ourselves to designing all our proof systems to be type checkers or things like them! I'll elaborate on this shortly.

One size doesn't fit all

I believe it's too limiting to expect a single compile-time analysis to tackle all proving and proof-checking workloads. If we write some immature code that's doing something subtle, we might expect machine proof (or disproof) to take some time. As it gets more mature, we can add more annotations and generally structure the code better, to help make the proof fast. We can't just forbid ourselves from writing or executing immature code. Of course if we consider the BIND scenario, meaning the case of mature, production software, then the deployed code should be sufficiently mature that a fast compile-time analysis is capable of producing and/or checking the proof. But we need tools that let us progress code across the maturity spectrum, not just demand a fixed level of maturity.

Language-independent analyses

One of the reasons that we get hung up on language so easily is that static analysis systems like to have a source-level input language. There's no reason why they need to, though. As programming researchers, I'd argue we have never been very good at recognising the need to accommodate diversity of programming languages in most of the infrastructure we design. This can and should change. (It's something I'm working on, intermittently.) One approach I think makes sense is “binary-level analysis, source-level properties”. We can annotate source code and push those annotations through into binaries, and check that they hold of the binary level. Binaries are what is deployed, after all, and as I've argued already, deployment is the point where the assurances are most needed. It also defuses complaints about the ambiguity of source languages. Binaries are a lot less ambiguous (and we're working on improving that too). While source-level correctness for all deployment environments, i.e. “correctness portability”, is a nice property, it's a harder problem and not always important. We should walk before we run.

Simple and not-so-simple bugs

Is it enough to consider only the simple bugs? During the talk, Jon Crowcroft rightly put it like this (I'm severely paraphrasing): rewriting everything in a better language would bring us to the next level of problems, and these would be more interesting than most of the ones we currently face, but (by implication) such problems will exist. It's not clear to me that a similarly nightmarish scenario could not occur in BIND owing to some much more subtle bug than a buffer overflow (or similar). If what I just said happens not to be true for BIND (again, following Jon's comments that DNS and other core internet services are simple), it's probably true for some other critical system (like the web).

The static-to-dynamic continuum

Some discussion of assertions during the talk was interesting. One audience member commented that some software crashes more than it would if it didn't contain assertions, either because the assertions are wrong or are not sufficiently critical properties to justify termination of the program. I find it hard to blame the assertions for this, It's true that if we must tolerate a risk that assertions might fail in production code, “log and [attempt to] continue” is a marginally better strategy than “abort”. But Steve Crocker countered that for critical, deployed software, assertions should be proved to hold—a position I agree with. That's a moral “should”; in practice we have to be very clever about how we enforce this. For example, we wouldn't want to unwittingly encourage programmers to delete assertions that they couldn't prove to hold. More importantly, to repeat my last paragraph, we need to allow developers to progress a codebase from immature to mature states. We might only deploy it when it's mature, but for development, we need to have something which admits run-time failures, yet is still executable. This is another reason why I believe a fixed compile-time analysis isn't the right approach, and that we should instead choose the level of assurance we want at deployment time. The tool that establishes (or denies) this assurance might also be a compiler, but needn't be.

Assertion languages, not annotation languages

Even though we don't currently prove assertions to hold, I'd argue that assertions are a great success because they elicit specification from programmers. Most programmers use them, and they elicit a lot of subtle properties which might otherwise not get explicitly written down. Improving the expressiveness of assertions is an approach that I'm particularly keen on. The TESLA work of Jon Anderson, Robert Watson and colleagues, on temporal assertions, is a great example. I have some work brewing about expressing type correctness using assertions.

The neat thing about assertions is that they are expressed in the programming language, and easily understood by any programmer. I believe that building specification and verification infrastructure out of programmer-friendly constructs, like assertions, is a necessary step to making them standard practice. We should avoid forcing programmers to use formalisms that don't map very clearly to the program. Again, TESLA is an example of this. Reportedly, programmers find it easy to write TESLA automata. They might not find the same thing easy to write in LTL or CTL*, even though those formalisms might be more friendly to somebody who writes verification algorithms. So, when I hear mention of “annotation languages” as a separate thing that needs to be bolted on to a source language, my reaction is to ask whether these annotations can be expressed in the source language itself, augmented by only a minimal and programmer-understandable set of new primitives. In my book, such a property then becomes an assertion rather than an annotation.

The performance culture

This is a bit of a departure but I can't help throwing it in. It's generally considered that infrastructure that slows programs down more than a few percent is not acceptable for deployment use. This is in spite of the relative plenty of computing power and the relatively huge expense of simple bugs like buffer overflows. Suppose we reach a point where all the “basic” bugs are routinely proved absent via verification. There might be some less basic properties which we can only check dynamically at nontrivial cost. Do we leave the checks in, or not? I'd argue that the performance nuts are usually getting the calculations wrong. The cost of extra servers pales to the cost of downtime, security compromises and the like. and they also pale next to the cost of debugging subtle failures that are not caught cleanly. Unfortunately, saying that the slowdown is too great is still a popular way to reject research papers. The commonly accepted criteria about “suitability for deployment” are finger-in-the-air stuff.

The I/O bottleneck

This was the least obvious of my thoughts, but is for me one of the most important. Let's consider the Heartbleed bug, which is triggered when a malicious client supplies a bogus length field and causes the server to overrun its buffer. This particular overrun would have been prevented by a memory-safe language, because the buffer copy would be bounds-checked. But why do people write code at the level of bytes and buffers anyway? Despite what some people write, “type safety” is not quite the issue here. Bytes and buffers are fundamentally semantically impoverished things, and yet we find ourselves writing this kind of code in every language, no matter how “type safe”, and no matter how expressive its data abstraction features might be. Coding at the level of bytes and buffers is always error-prone, even if one or other kind of error is ruled out by one or other kind of check mandated by a given language.

The reason we do it is because of I/O. I/O happens on bytes and buffers, because that's what the libraries expose. In turn, they blame it on the operating system: that's what Unix exposes. Once upon a time there was a lot of work on persistent programming languages, but they have not taken off. I suspect it's because they require too much buy-in. We don't want our data to be an opaque box that is managed only by a language runtime. What we need instead is to add specification to the byte-streams we already have. If we produce or consume byte-streams, we should specify their format, both syntactically and including some basic semantic properties. If we're OpenSSL, we'd specify that the first two bytes of the heartbeat message are a length field, and that it should be equal to the remaining length of the buffer. This can already be expressed in languages like Datascript. We might be able to come up with something even better (ask me!) but in any case, we also now need to push this kind of specification facility deeper into our libraries, language runtimes and operating systems. Once we've done so, it's trivial to insert bounds checks. We can even check subobject bounds inside the payload, i.e. referential integrity within the payload itself. In general, once we know the meaning of the bytes, we can actually treat I/O payload data abstractly. By contrast,even a “type safe” language currently only achieves abstraction by copying the data into data structures that it manages. Doing so unavoidably requires interpreting those bytes, and doing so unsafely—any checks that are done are entirely down to the programmer.

It's no coincidence that I've advocated this before. I'm working on it—after I work on all the other things that come before it in my list. If somebody wants to join me, that would be great....

[/research] permanent link contact

Tue, 01 Jul 2014

Drop-in debugging

If you're debugging a problem that occurs in some process deep in a process tree, getting a debugger onto that process can be hard. I recently wrote a script called gdbrun that can help with this. The idea is that for any command cmd args, you can do gdbrun cmd args and it will run the program under a debugger, while externally behaving just like the original program. The exception is if the program doesn't terminate properly: in that case the debugger sticks around for you to interact with. (If you need a more complex condition than “doesn't terminate properly”, you can code that into the script too.)

This isn't ideal—say, if you only want to debug only one invocation of that binary among many. But it works in many cases. (I'm also working on a solution for the harder case, but one thing at a time.) The script is deceptively simple. Sadly it needs GNU bash, but for any non-GNU people it should be easy enough to port.


exec 7<&0
exec 8>&1
exec 9>&2

quote () {
    sed "s/\([\`\\\\\\\"\\\$]\)/\\\\\\1/g"

while true; do
    if [[ -n "$1" ]]; then
        # don't use 'echo' to generate input for quote, otherwise "-e" will be disappear'd
        argstring="${argstring:+${argstring} }\"$( cat <<<"$1" | quote )\""
    ctr=$(( $ctr + 1 ))
    shift || break


# can add -hold here to help debugging
xterm -e gdb \
    --eval-command "file $exe" \
    --eval-command "run $argstring </dev/fd/7 >/dev/fd/8 2>/dev/fd/9" \
    --eval-command "set logging file ${exit_status_file}" \
    --eval-command "set logging on" \
    --eval-command "print \$_exitcode" \
    --eval-command "quit \$_exitcode" </dev/null

inferior_status=$(cat ${exit_status_file} | sed 's/.* = //' )
exit ${inferior_status:-0}

What was hard? Firstly, it took some head-bashing to realise that sharing a terminal between the debugger and inferior wasn't going to work. That's why the script uses xterm—popping up terminal windows for an instant is a minor annoyance, but tolerable. I can't say for certain that a shared-terminal solution isn't possible, but the usual trick of stty -tostop wasn't enough to stop child processes hanging from contention for the terminal, so I gave up.

The second hard thing, and the hardest, was realising that I could thread duplicated file descriptors right down to the inferior, and open them with /dev/fd/7 and friends. This was non-obvious to poor old me. I first tried a solution with named pipes and cat processes forwarding data from/to the gdbrun script's streams to the inferior's streams. After much gnashing of teeth, this turned out to be unworkable. The reason is that pipes are asynchronous channels, meaning they're buffered. A consequence is that even if the inferior doesn't read from its input, some input data will be buffered in the pipe when it gets opened (actually it's opened by the shell that launches the inferior, using gdb's run <infile syntax), and then discarded when the inferior dies. So if you have a command like echo blah |(gdbrun true; cat) your blah will get gobbled into oblivion. The moral of the tale is that introducing cats and pipes is a recipe for mischief.

[/devel] permanent link contact

Fri, 13 Jun 2014

Linking and loading: what's incidental?

Not long ago I gave a couple of talks about linking, loading and debugging here in Cambridge. It's a notoriously grungy area of our infrastructure, and a recurring question people asked me after the talks was as follows: what's the incidental complexity and what's the intrinsic complexity? Put differently: how might we redesign the whole lot nowadays, to get a less complex end result?

It's hard to answer, and partly it depends on what you value. On the surface, it seems like linking should be conceptually simple. Digging deeper though, it becomes apparent that there are a lot of concerns involved, and they are often in tension, making it very tricky for any simple solution to satisfy all requirements. That doesn't mean that simpler solutions are not possible, but we should not underestimate the intrinic complexity either.

What intrinsic complexity am I talking about? I can see four concerns central to the design of ELF dynamic linking. One is link speed: linking should be as fast as possible. Another is memory sharing: shared libraries were introduced (in part) to save memory. Another is flexibility, achieved through interposability: it should be possible to customise the link to add or replace features without modifying binaries. The last is federation: ELF is shared by many vendors' toolchains and many OSes, so the ELF runtime necessarily has a descriptive role, which is the basis for cross-toolchain runtime protocols for stack walking, exception handling and the like.

Part of the problem is that many people don't value all four of these concerns simultaneously. Link speed is valued by software distributors (who want to cut down startup times) and people who use embedded platforms (who can't afford overheads), but not workaday developers or, directly, users. Position-independence is valued by people who run the kind of systems where it saves memory, like desktop systems, and not by those who don't, like most servers. (Plan 9's decision not to bother with shared libraries is both reasonable, if you run mostly servers, and unreasonable, if you run mostly large desktop applications.) Interposability is valued by people who write tools and those with unusual requirements—which often entail reimplementing selected parts of core libraries (think malloc())—but not by those who don't. Federation is valued by those who value “openness”—competition among vendors, interoperation of tools, multi-language programming, and so on—but not by those who are comfortable with closed platforms, VM-style designs and/or “one true language”.

Few people care about all things simultaneously, so linking seems overcomplicated to everyone. But all these concerns add to complexity individually, and the tension between them also increases complexity. Link speed concerns create incentives for doing more work in the compiler or compile-time linker. This creates “premature optimisation” phenomena which complicate linker implementations (by adding to the variety of relocations) and also complicate the user-facing parts of the linker (such as getting the link options right). Sharing memory entails position independence; this creates complexity for the compiler author, who must support multiple code models, and the user who must grok the related compiler options. Interposition creates complexity for the linker author and user, and costs in performance since it prevents some binding happening earlier than it otherwise could. Federation adds complexity for the compiler author, who must take care to correctly follow conventions and emit metadata such as unwind information and debugging information. It also brings some cost at run time, such as how a cross-language cross-vendor exception protocol is slower than a custom one.

What design decisions can we rethink to get a system that satisfies all these requirements but with less incidental complexity? I believe that a smarter linker could help. The Unix-like linker design, as evolved to elaborate formats such as ELF, can be described as “smart format, dumb linker”. A key “dumbness” of the linker is that it does not know about instruction streams. Rather, the linking format describes all the ways that relocations can be applied as simple arithmetic on byte sequences. The linker's output is a mostly-deterministic function of its inputs (including linker scripts).

If we had a smart linker, we could have slightly dumber compilers—a win for simplicity since there are many more compiler implementations than linker implementations. We could also have less user-facing complexity: the compiler could just output code according to the slowest but most flexible code model, and the linker could take care of optimisations at the instruction level, including position independence. The key difference between such a linker and the current one is that a linker that can do these things must be able to parse and rewrite instruction streams. Current designs have been crafted to avoid this, but I believe this is a good fit for linking and not the burden that it first appears to be.

A smarter linker needn't yield slower programs than a dumb one. In fact it could speed them up. Link-time optimisation is rarely used at present, and not at all across dynamic linking. Meanwhile, interposition has nontrivial run-time costs in some cases, since symbol bindings must be computed at load time and run time. A smarter linker could use its awareness of instruction streams to deploy a variety of strategies to speculatively bind, cache and unbind references, both in disk images (which can be thought of as a less ad-hoc prelink) and in memory (e.g. inlining calls that currently go through the PLT). The former option can be thought of as link-time deoptimisation: we store on disk a binary that has been adaptively optimised for the common case on the particular system, and then if we find that an unusual link requirement comes along, the linker can produce a more flexible deoptimised version. The idea of “linkers as servers” has been explored before (e.g. in the OMOS linker, from Utah) but not in this particular way.

Compatibility concerns have also greatly complicated linking. Before shared libraries, things were fairly simple. To link against the C library, you link with -lc. This is just a bunch of object code in an archive. Contrast that with now. Shared libraries have been added in a “compatible” way that sits alongside, but doesn't replace, archives. As a result, we use both, and get double the complexity. For example, on GNU platforms we mostly use shared libraries, but libc_nonshared and -lgcc et al complicate all that. Since “compatible” shared libraries could not quite offer a complete solution for replacing the old-style linking, we have bifurcated the previous clean static linking model into a mostly-shared-but-partly-static model. A smarter linker could internalise the goal of code sharing. By rewriting instruction streams at load time, any compiled code is potentially shareable, and it's the linker's job to extract an adequate degree of sharing. There's no need for the user-facing complexity of distinguishing shared from non-shared linking.

Shared libraries themselves must bring some added complexity, since being (logically) shareable is a stronger requirement on a chunk of code than having it work only in a single context. In particular, robust shared libraries require symbol versioning or some other notion of “semantics”, to account for the inevitable evolution of software. Current designs for symbol versioning are themselves a great example of unnecessary complexity. Again, they were designed to be a transparent extension to what already existed—versionless shared libraries. This has yielded an overly circuitous solution. By pushing symbol versions into the symbol name, a lot of complexity is pushed to other contexts. Most irritatingly, cients of dlsym() are essentially broken by it, because they must deal with symbol names munged to include versions (or must switch to using dlvsym()). The “exact match” semantics offered by dlsym() no longer offer the right abstraction when looking up symbols provided by versioned libraries. But this solution was accepted because it didn't syntactically break anything (and semantics are the user's problem, right?). The ideal solution would look up symbols not (just) by name but also (partly) by their semantics.

Interposition is also arguably not a fully realised idea. One problem is that not every symbol is interposable, because of the static/dynamic split I mentioned earlier: some programs, and some parts of most programs, are still linked statically. Moreover, “interposition misses” easily occur where the interposer doesn't interpose on all the def–use edges it's supposed to, For example, in glibc, many library-internal calls to low-level functions like mmap() don't go via the publicly-exposed symbol. Calls that appear to be to open() might actually be linked to a mysterious alias of that symbol, such as __open_2(). Although interposition is still useful, the possibility of “missed” calls and the need to grok undocumented binary interfaces like __open_2() make interposition-based software fragile and expensive to maintain. This could be fixed by a more uniform dynamic linking model and a more semantically-founded notion of interposition. One that might work would be interposition based on code paths, not entry points. Again, implementing such a notion inside a linker would require that linker to understand instruction streams and the control flows they allow. When we ask to interpose on a (function) symbol, what we're really asking is a bit more like to insert a new call-graph node that dominates all those in the interposed-on call graph. (It's a bit more complicated than that, since this alone doesn't solve the glibc problem I mentioned, but this basic idea can likely be extended to yield the interposition semantics we really want.)

Hardware complexity also leaks into linkers. The root cause of complexity in relocation and code models is the diversity of addressing modes offered by hardware. Linker authors and others working at the toolchain level tend to take on trust the good sense of the hardware designers' decisions. Whether this complexity is intrinsic or incidental depends on whether we allow ourselves to rethink the hardware too. For the moment I'm going to take this diversity as a hard constraint, but I'd welcome views from more hardware-minded people.

All this is not to say that old-fashioned complexity creep isn't behind its share of problems. Debugging information is a great example of “limb growth” and is crying out for a cleaner design, ideally one that helps reduce the burden on compiler authors and/or improve the accuracy of the debugging information they generate. For linking proper, though, the old-fashioned Unix model of “dumb linking” has stayed relatively clean. It's been keeping that model, in the face of new requirements such as shared libraries, versioning and so on, that has complicated things. I believe it's no longer the right model. Luckily, we can produce a smarter linker that is nevertheless a drop-in replacement for the standard one, so we have an evolutionary path towards something better.

[/research] permanent link contact

Powered by blosxom

validate this page