Rambles around computer science

Diverting trains of thought, wasting precious time

Mon, 18 Oct 2021

ELF dynamic linking: a brief introduction

[I wrote this to help current or future students who might want to do projects involving dynamic linking. But it may be of interest more widely. There may well be an equivalent or better article out there; if anyone knows one, please let me know.]

In old-fashioned static linking, the executable file captures a complete starting “memory image” of the program being loaded. The kernel simply maps the binary into memory, creates an initial thread stack, dumps some useful information onto it (the auxiliary vector) followed by the program name and argument strings, then transfers control to the entry point identified in the ELF binary's header. So far, so straightforward. I'll assume readers familiar with this. (If not, among other resources, you could read section 3 of the OOPSLA 2016 paper I co-authored.)

In dynamic linking, the executable no longer provides a complete image; rather, it is one of a collection of “dynamic shared objects”, or DSOs, that provide the starting memory image. (Normally, the executable itself is counted as a DSO, even though it may not be “shared”.)

The dynamic linker composes this image at load time, by loading both the executable and the (transitive closure of) depended-on library DSOs.

Dynamic linking is delegated by the kernel to user space: the kernel still only knows how to load static binaries, and the dynamic linker is a special static binary that knows how to load (and link) dynamic binaries. The kernel is extended just enough to split this case off: it must figure out, after loading an executable, that actually the dynamic linker is needed to complete the loading. The dynamic linker can be any binary nominated by the executable file, but usually the standard one (for me: /lib64/ld-linux-x86-64.so.2) is the only one on the system. In practice, the kernel will map both the executable and the dynamic linker into memory, so the dynamic linker only normally needs to load the libraries, not the executable. (Sometimes the dynamic linker can also be run as a program, in which case it has to load the executable too.)

DSOs have dependencies on other DSOs, in the form of NEEDED records. You can see these in the dynamic linking information dumped by readelf -d. For example, here is what I get from a simple hello C program, which needs only one library.

Dynamic section at offset 0x2df8 contains 26 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x1000
 0x000000000000000d (FINI)               0x11b4
 0x0000000000000019 (INIT_ARRAY)         0x3de8
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x3df0
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x308
 0x0000000000000005 (STRTAB)             0x3d8
 0x0000000000000006 (SYMTAB)             0x330
 0x000000000000000a (STRSZ)              130 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x4000
 0x0000000000000002 (PLTRELSZ)           24 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x548
 0x0000000000000007 (RELA)               0x488
 0x0000000000000008 (RELASZ)             192 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffb (FLAGS_1)            Flags: PIE
 0x000000006ffffffe (VERNEED)            0x468
 0x000000006fffffff (VERNEEDNUM)         1
 0x000000006ffffff0 (VERSYM)             0x45a
 0x000000006ffffff9 (RELACOUNT)          3
 0x0000000000000000 (NULL)               0x0

These records give just the names of the depended-on DSOs (actually their soname—more below). Where in the filesystem these binaries are to be found depends on the linker search path, which is controlled by your LD_LIBRARY_PATH environment variable, by a configuration file, and by any RUNPATH records in the loaded file (also visible using the command above, although our hello does not contain any) and, transitively, those in any files loaded to satisfy a NEEDED dependency. The ldd command asks the dynamic linker to do a trial run and show where it located the dependencies (or failed to!).

In memory, each DSO gets a range of the virtual addess space, starting at some load address. So in the DSO's ELF binary on disk, all “addresses” are actually offsets relative to the load address. It used to be that executables always had a load address of 0, so the executable file's addresses were bona fide virtual addresses. However, nowadays executables are typically position-independent and get loaded at a non-zero address too.

(DSOs are mostly contiguous in memory. However, nothing prevents them from containing holes in between their mapped segments. In edge cases it is possible for holey DSOs to be loaded such that they overlap. This is rare, but I did have to go to considerable lengths to support it, or rather prevent it, in one of my runtime projects.)

After loading, there may be some load-time relocation required. This is performed by the dynamic linker and is very much like the relocation that happens during static linking, but is used to resolve inter-DSO references. (In some degenerate cases, it also resolves intra-DSO references that have been deferred until load time.) Many of the entries in the output shown above are telling the dynamic linker where to find the relocation and symbol information for this.

A “shared library” is a library DSO whose text segment does not need load-time relocation. In other words, it works fine as-is when loaded at any address. This flexibility to map the file simultaneously at many addresses, without the need to fix it up in memory after loading, is what allows memory to be shared between many processes under ELF dynamic linking. (Some other linking systems systems, including Windows DLLs, instead force the library to be placed at the same virtual address in all address spaces, hence the infamous process of “registering” DLLs.) Elimination of any need for load-time relocation is achieved by indirecting inter-DSO references that would otherwise be (say) direct call instructions, so that they occur instead via a couple of (non-shared) tables in the data segment: the Global Offset Table (GOT) and Procedure Linkage Table (PLT). Roughly, the latter is used for calls, the former for data access. Others have written about those (see Eli Bendersky's very nice in-depth article), so let me just say that the GOT is exceptionally poorly named: it is not global (there is one per DSO), and does not contain offsets (it contains pointers, at least after relocation). They should both arguably be called “vectors” rather than tables, because they don't have multiple columns.

Despite the “S” in “DSO”, there is no requirement that library DSOs satisfy this shareability property. Instead, ELF linkers usually complain if you try to create a DSO that does not (often via an inscrutable message ending with “recompile with -fPIC”). The dynamic linker doesn't care, however. If it has to fix up your text, it can do so, relying on copy-on-write to allocate memory as needed. Shared libraries are “shared” only because that copying doesn't happen.

Since libraries and executable are now deployed as separate files, they are subject to version skew. Libraries often try to offer backward compatibility, such that a newer library can still support an older executable (or, equally, support some old library declaring it as NEEDED). A vaguely standard symbol versioning mechanism is defined for ELF, to allow a library to offer older and newer versions of “the same” symbol and have references get bound to the right one, preserving binary compatibility. Alternatively, a binary-incompatible version of a shared library may be issued, in which case it's good practice to reflect that in the library name. There is a convention for recording version bumps in a number that forms part of the “soname”, or shared object name, hence “libc.so.6” in the above listing. This suggests six version-bumps have occurred in the GNU C library. In this way, many distinct incompatible verisons of a library may be installed on the same system. Symbol versioning takes a bit more discipline and more build-time ceremony to set up, but it tends to be preferred over sonames, especially for libraries that are widely deployed. The GNU C library has not bumped its soname for many years.

DSOs have identity at run time. The dynamic linker maintains a doubly linked list called the “link map”. It is possible to load new DSOs, such as plug-ins, during execution, via the dlopen() library function. This adds them to the link map, alongside all the libraries loaded earlier. Indeed the dynamic linker itself shows up in the link map as just another library, providing (among other things) the dlopen call (or at least its back end; on GNU systems there is a separate libdl as a front-end). It also provides a symbol that gives access to the link map itself (_r_debug). Over this, a breakpoint-based protocol allows debuggers to walk the link map, so they can find out about the libraries that are loaded. DSOs can also designate initialization and finalization routines to be run when they are loaded and unloaded.

That's enough for a brief introduction. To get more details about what's going on at run time, you could read my older post about ELF introspection.

A few years back, Simon Ser did some great work on an executable semantics for dynamic linking, as a summer internship of the REMS project. Since then I have been struggling to get time to resurrect and publish this work, and it no longer falls directly within my prioritised interests. But if anyone is interested in taking it up, do contact me... it is begging to be polished off and written up.

[/devel] permanent link contact

Thu, 14 Oct 2021

Tracing system calls in-process, using a chain loader

Some months back I wrote about a possible better alternative to LD_PRELOAD using chain loading. And I promised a follow-up with more details... so here it is. We'll walk through a little system call tracer built using this technique. Unlike strace and similar, it will do its tracing in-process, without using ptrace() or equivalents. I should add that it's a proof-of-concept only so isn't nearly as good as strace in practice; in particular it can't yet accurately print the arguments to most syscalls. (I have some work towards doing that generatively, using kernel DWARF; more later.)

System call tracing is a good demo of chain loading, for three reasons.

Firstly, it's basically impossible to do with LD_PRELOAD, which works at the higher level of the C library and dynamic linker, so doesn't have sight of system calls.

Secondly, it's close to what people very often want to do with LD_PRELOAD, which is to change the behaviour of particular system calls. No longer do we have to make do with overriding the C library wrappers; we can intercept the system calls themselves, catching certain cases that would not be visible at the C library level.

Thirdly, intercepting system calls specifically provides the foundation for a more general bootstrapping approach to arbitrary binary instrumentation of programs, as I hinted last time. If we can catch all the system calls which introduce new instructions to the address space, and prevent instructions from being introduced otherwise, we can ensure that literally all code is instrumented.

Some years ago, Patrick Lambein and I wrote a library that can patch and emulate most system calls on x86_64 Linux, by a really simple double-trap technique: overwrite the system call with ud2, x86's “always-illegal” instruction, then handle the resulting SIGILL. The handler can do the real syscall or whatever else it wants, before returning. This is not the fastest, but it's reliable and easy(-ish) to implement—at least to 90% of the way. So, for example, if we disassemble the C library's _exit function which once contained the following instructions

44 89 c0                mov    %r8d,%eax
0f 05                   syscall  
48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
76 e2                   jbe    0x7f173bf269c0 <_exit+32>

under our library this gets clobbered to instead be

44 89 c0                mov    %r8d,%eax
0f 0b                   ud2  
48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
76 e2                   jbe    0x7f173bf269c0 <_exit+32>

which will always cause a SIGILL. Our SIGILL handler scrapes up the syscall arguments into a structure and then passes this to whatever registered handler is installed. For example, a tracing handler, which prints some stuff and then resumes the system call, looks like this.

void print_pre_syscall(void *stream, struct generic_syscall *gsp, void *calling_addr, struct link_map *calling_object, void *ret) {
        char namebuf[5];
        snprintf(namebuf, sizeof namebuf, "%d", gsp->syscall_number);
        fprintf(stream, "== %d == > %p (%s+0x%x) %s(%p, %p, %p, %p, %p, %p)\n",
                calling_object ? calling_object->l_name : "(unknown)",
                calling_object ? (char*) calling_addr - (char*) calling_object->l_addr : (ptrdiff_t) 0,
// what actually gets called from the SIGILL handler
void systrap_pre_handling(struct generic_syscall *gsp)
        void *calling_addr = generic_syscall_get_ip(gsp);
        struct link_map *calling_object = get_highest_loaded_object_below(calling_addr);
        print_pre_syscall(traces_out, gsp, calling_addr, calling_object, NULL);

To ensure that all loaded code gets instrumented appropriately with the ud2s, I've recently expanded this to use a bootstrapping technique: after initially applying our trapping technique to hook mmap(), mremap() and mprotect() calls, our hooks ensure that if the guest application is asking to create any further executable memory, we will instrument its contents before we make it executable.

But how can we get all this code into the process in the first place?. What I did initially was use our old familiar LD_PRELOAD. We preload a library whose initializer stomps over the loaded text, clobbering syscalls into ud2 and installing the SIGILL handler. This is fine as far as it goes, but it misses the very beginning of execution. A tracer doing this cannot trace anything happening before the initializer runs, for example in other libraries' initializers or inside the dynamic linker. What we get is the following.

$ LD_PRELOAD=`pwd`/trace-syscalls.so /bin/true
== 26439 == > 0x7f6fb2a4f9d4 (/lib/x86_64-linux-gnu/libc.so.6+0xc99d4) exit_group(0, 0x3c, 0, 0x7fff766064b0, 0xe7, 0xffffffffffffff80)

It's neat that it looks like true only does one syscall, to exit_group(), as we might think it should. But if we try strace we find that in truth the process does a lot more than that one system call.

$ strace /bin/true
brk(0)                                  = 0x2447000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6ba773c000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libfakeroot/tls/x86_64/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/libfakeroot/tls/x86_64/x86_64", 0x7ffd49ca38e0) = -1 ENOENT (No such file or directory)
... snipped a *lot*! ...
arch_prctl(ARCH_SET_FS, 0x7f6ba7578740) = 0
mprotect(0x7f6ba7732000, 12288, PROT_READ) = 0
mprotect(0x605000, 4096, PROT_READ)     = 0
mprotect(0x7f6ba7765000, 4096, PROT_READ) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

All these other, earlier system calls are happening during dynamic linking, or in general, before our constructor that clobbers the syscalls into ud2. (The strace manual page laments: “it is a pity that so much tracing clutter is produced by systems employing shared libraries”. My take is that it's better to know the truth!)

Let's repackage our code as a chain loader, as described last time. That way it will always run first, and because it's responsible for loading the dynamic linker, we can ensure that all its syscalls are trapped from the very start. This is a non-trivial delta to the donald-based code I mentioned in the last post. After we map the “real” dynamic linker, we also need to instrument its code. We do this by wrapping the enter() function from donald.

void __wrap_enter(void *entry_point)
        /* ... */
        trap_all_mappings(); // only ld.so is loaded right now
        /* ... */

What about code that the dynamic linker loads? Well, we use our bootstrapping approach to catch that. Whenever new code is loaded into the process, some system call is responsible for doing that. So as long as we catch that call, we can trap the system calls at load time. A bit of macro magic helps generate the wrappers, and some elided hackery is required to divine the ELF-level structure of the file being mapped.

#define mmap_args(v, sep) \
    v(void *, addr, 0) sep   v(size_t, length, 1) sep \
    v(int, prot, 2) sep      v(int, flags, 3) sep \
    v(int, fd, 4) sep        v(unsigned long, offset, 5)
    if (!(prot & PROT_EXEC)) return mmap(addr, length, prot, flags, fd, offset);
    // else... always map writable, non-executable initially
    int temporary_prot = (prot | PROT_WRITE | PROT_READ) & ~PROT_EXEC;
    void *mapping = (void*) mmap(addr, length, temporary_prot, flags, fd, offset);
    if (mapping != MAP_FAILED)
        /* ... */
        if (/* needs trapping*/)
            inferred_vaddr = (uintptr_t) mapping - matched->p_vaddr;
            trap_one_executable_region_given_shdrs(mapping, (void*) mapping + length,
                    filename, /* is_writable */ 1, /* is_readable */ 1,
                    shdrs, ehdr.e_shnum,
            // OK, done the trap so let's fix up the protection (always different)
            ret = mprotect(mapping, length, prot);
            if (ret != 0) goto fail;
            /* ... */
        /* ... */
    /* ... */

Making this work on Linux/x86 (32- and 64-bit) required some “interesting” code in a few cases, over and above the chain loader trick. The current version still has some unsupported features (sigaltstack() among them) but can support a lot of stuff thanks to the following gyrations.

The good news is that things mostly work. In particular, I can do:

$ ./trace-syscalls-ld.so /bin/true

and get “full” output analogous to that of strace, instead of the old output that was missing many system calls.

What about correctness of all this? This bootstrapping approach only gets us 100% coverage of system calls on some assumptions.

Assumption 1 isn't true for gettimeofday() in Linux's vdso, which implements it as a read from kernel memory. No “interesting” system calls are implemented that way. Do we really just mean “system calls”, though? One could argue that fault-handling paths are in some ways like system calls; we can't trace entries and exits to/from these, though neither does strace. Signal delivery is also a bit like a return from a system call, and we don't trace that, in contrast to strace (which gets any event ptrace exposes, and that includes signals). We could trace some of it by installing our own handlers for all signals, though that would require a lot of care to preserve the program's original semantics... it starts to become a full-on “transparency” problem, of the kind DynamioRIO's authors have articulated.

For assumption 2a to be true, we must forbid mappings that are simultaneously writable and executable. Some programs, such as the just-in-time compilers of V8 or PyPy, still likes to create executable memory and write into it. If we allow such mappings, we have no synchronous opportunity to modify code before it becomes executable. My best guess is to allow these programs to carry on with an emulation of writable-and-executable mappings, adaptively switching between really-writable and really-executable and handling the segmentation faults in user space. The overhead would be great if done naively, but some kind of adaptive hysteresis might handle it well enough (even maybe as simple as “flip after N faults”).

A more subtle version of the same problem is that we need to avoid distinct simultaneous W and X mappings of the same underlying object, such as a file that is MAP_SHARED. That would have the same effect as a writable-and-executable mapping. Such a check is easy in principle, but requires tracking the inode and device number behind every existing file mapping. (File descriptors are OK because we can fstat() them on demand.) Attempts to remap the same object should probably return -ETXTBSY. I've left that as an exercise for someone else, for the moment.

Assumption 2b brings the issue of overt instructions versus punny instructions. On x86 there is no alignment requirement for instructions, so we can jump into the middle of instructions and thereby do something quite different than intended, perhaps including system calls. For example, the innocuous instruction

 mov    $0x9090050f,%eax

assembles to

 b8 0f 05 90 90

but if we jump one byte into this byte sequence, we do a syscall followed by two nops. Preventing this is tricky. Perhaps we trust code not to do this. Perhaps we built the program using some heavyweight control-flow integrity mechanism (and can also check that dynamically loaded code satisfies its requirements). Perhaps we are on an alignment-preserving architecture (so, not x86). For x86, once we have a sufficiently general jump-based trampoline mechanism, perhaps itself making use of instruction punning, we can forcibly stub out any instruction or instruction pair that internally contains a system call byte sequence. Doing this is sufficient only if our trampoline itself is guaranteed not to include such instructions. Such a guarantee is quite hard when immediates are involved... we would have to forbid them to contain the bytes 0f 05, for example. The same is true for addresses or offsets, but these are easier under PC-relative addressing (we get to choose where to place the trampoline). Most likely we can rely on such immediates being so rare that if we simply clobber the instruction with a ud2 and emulate it with something like libx86emulate, nobody will notice any slowdown.

Erk. That was a deeper dive than I planned. But the take-home is that early experience suggests this seems to be possible. We can intercept system cals using a chain loader, bootstrappingly, and so get control of the program's interactions across the system call interface. This lets us build all manner of tools rather like LD_PRELOAD, but directly at the system call level and with stronger coverage properties.

[/devel] permanent link contact

Mon, 04 Jan 2021

Chain loading, not preloading: the dynamic linker as a virtualization vector

Suppose you want to run a pre-existing binary program but in some kind of instrumented or modified form. On ELF-based systems, people often do this with LD_PRELOAD: you preload a library that interferes with some C library calls or somehow tweaks the process state at startup (say to install a signal handler). The program itself is (hopefully) oblivious to what you've done, but its behaviour is modified accordingly. Some years back I saw a nice survey of applications in a presentation by Kevin Pulo; there may be other resources out there.

This post is about some alternative ELF-level trickery, that is sometimes a better option for doing this. Here it's shown on GNU/Linux, but any ELF platform should allow it.

If you've tried LD_PRELOAD, you'll know that there are lots of problems and gotchas with it. Your code has to deal with being initialized at an unpredictable time; the C library's binary interface evolves from version to version, so is prone to additions that sidestep your preloads; it doesn't work for programs that do direct system calls (not via the libc) or are statically linked. What I'm about to describe avoids many of these problems, by creating what I'll call an ELF chain loader. This loader must do any program instrumentation at the instruction level, rather than the linkage level as with LD_PRELOAD. So in a sense, chain loading is less powerful than preloading: it's working at a lower level, and all your tweaks must be somehow “installed” at startup, rather than as part of the “dialogue” of library calls. However, there's a pattern we can use to overcome that limitation, so in fact I'd argue the result is more powerful than LD_PRELOAD, although (for now) harder to use.

Suppose you have a command invocation

$ ./myprog

and want your tweaked command to be invocable something like

$ LD_PRELOAD=mymods.so ./myprog

or (nicer)

$ ./mymods ./myprog

In the latter, mymods is a chain loader binary. A chain loader is basically a stub dynamic linker. It doesn't actually do dynamic linking; it's invoked like a dynamic linker, but it just loads another dynamic linker—after doing whatever tweaks and fiddles you'd like it to do.

Dan Williams has written (on a page now sadly disappeared, but WayBack-available) about how to build your own dynamic linker or ld.so. Building a binary of the right form is the easy part. Building one that actually links and runs real programs is a big job, especially if you're using a complex C library (like glibc) or complex run-time features (like thread-local storage). But in outline, a dynamic linker basically looks like this.

  1. do bootstrap relocation
  2. initialize some stuff (link map, TLS, ...)
  3. locate and map libraries, transitively, resolving symbols, applying relocations, etc.
  4. call constructors
  5. jump to the main program's entry point

The chain loader that we'll build is basically just a stub, which will call onward to the real ld.so to handle most of the above stuff. So it is much simpler:

  1. do bootstrap relocation
  2. mess with the process however we like
  3. locate and map the next loader (e.g. the real dynamic linker)
  4. (optional) disguise the fact that we ever existed
  5. jump to the next loader's entry point

If we do it right, the next loader won't even know that the stub loader has run. We can think of a chain loader as a “virtualization vector”. It sets up a (somehow-) virtualized environment, in which the entire original process, from dynamic linking onwards, then proceeds to execute.

(The “disguise” part is interesting. It makes us almost indistinguishable from malware! Virtualization is about transparently subverting a pre-existing process's execution, and there's a large class of malware that also does this.)

A few years ago I wrote a skeleton of a dynamic linker, called donald. It can load only very simple statically linked programs—hardly a real dynamic linker! Nevertheless it's a good vehicle for understanding what goes on during process start-up, and is a basis for building our chain loader. Caveat: it's very quick-and-dirty; it doesn't always use robust coding patterns, and it uses some unportable hacks to keep the code simple. That makes it possible to show almost all of it in a single blog post! Following Dan Williams's recipe, the command we use to link donald is the following.

$ cc -std=gnu99 -g -fPIC -fno-stack-protector \
  -fuse-ld=bfd -Wl,-Bsymbolic -nostdlib -nostartfiles -shared \
  -o "donald.so" premain.o main.o entry.o load.o \
  -Wl,-Bstatic /path/to/musl/lib/libc.a -Wl,-Bsymbolic \
  -T donald.lds -Wl,-soname=ld-linux.so.2 

where we're using a nice simple C library implementation, namely musl, to provide the libc calls that we use internally. The donald.lds file is a lightly tweaked linker script that adds a symbol that'll help us do bootstrap relocation, as per Dan Williams's article. The work of donald is done in the C files premain.c, main.c, entry.c and load.c.

“Bootstrap relocation” refers to the solution to an essential problem: who links the dynamic linker? In old-fashioned static linking scenarios, binaries contain a ready-to-go memory image. The program is loaded at address 0, all address bindings are known, and references have been fixed up when the binary was linked. By contrast, in dynamic linking scenarios, a library might be loaded at any address, meaning it still contains some relocation records describing fixups needed at load time. These are done by the dynamic linker. But what if you are the dynamic linker? The dynamic linker itself is just ld.so, a shared library; it, too, can be loaded anywhere. So who relocates the relocator? The answer: it relocates itself. The entry path of the dynamic linker consists of code specially crafted to run correctly regardless of where it is loaded. One of its first tasks is to locate the table of its own relocation records, relocate itself using them, and then continue execution in a “normal” environment.

On the x86-64 architecture, this is pretty straightforward because most data and code references are PC-relative. (On other architectures it's often much hairier, because it's harder to avoid using any absolute address references before you've completed bootstrap relocation.) Usually the entry path is written in per-arch assembly, precisely to accommodate the less straightforward architectures. In donald, which works only on x86-64, we have the option to write it directly “in C”. It's not really C, because we're bound by a very wacky contract: don't write any code that will cause the compiler to use non-relative addressing modes! The entry point of donald, called directly by the kernel, looks like the following.

/* The function prologue pushes rbp on entry, decrementing the stack
 * pointer by 8. Then it saves rsp into rbp. So by the time we see rbp, 
 * it holds the entry stack pointer *minus 8 bytes*. */
#define BP_TO_SP_FIXUP 0x8
void *rsp_on_entry HIDDEN;
/* This isn't the usual "main"; it's the raw entry point of the application.
 * We link with -nostartfiles. We then define our own main. */
int _start(void)
    /* gcc doesn't let us disable prologue/epilogue, so we have to fudge it.
     * We assume rsp is saved into rbp in the prologue. */
    register unsigned char *bp_after_main_prologue;
    __asm__ ("movq %%rbp, %0\n" : "=r"(bp_after_main_prologue));

    int argc;
    char **argv;
    rsp_on_entry = bp_after_main_prologue + BP_TO_SP_FIXUP;
    preinit(rsp_on_entry, &argc, &argv); // get us a sane environment

    printf("Hello from " DONALD_NAME "!\n");

    int ret = main(argc, argv);

    /* We're executing without startfile code, so returning 0 would not make sense. 
     * Calling exit() brings a dependency on fini_array stuff that we need to avoid
     * since it depends on the startup files. So just do the syscall directly. */
    syscall(SYS_exit, ret);


When the kernel jumps to the above code, the stack pointer is pointing at a bunch of useful information: the program arguments and so forth. But the C compiler will generate a prologue that moves it away from that. So the first thing we do is use a hard-coded hack to “undo” that, and get the “stack pointer on entry”, which we pass to our preinit function.

Using this pointer, preinit can navigate the initial stack to find all that data of interest. The data starts with argc; above that is argv (the vector of pointers to argument strings), above that envp (the same but for environment variables), then the auxiliary vector of records containing “stuff the kernel wanted to tell us” and finally the strings these argv and envp vectors point to. You can read more about auxiliary vectors here. Our preinit parses this chunk of memory, and extracts a couple of facts that we'll need to do our work: the page size and our own base address.

Bootstrap relocation uses the latter of these. We find our own dynamic linking information (symbol table and relocation table) using the _DYNAMIC symbol. But remember “this isn't C”: simply referring to _DYNAMIC may not work yet, because it's defined outside the current compilation unit (by the linker, in fact) so the compiler may emit a reference that needs relocation. By definition, the relocation hasn't been applied yet. In donald I used an unorthodox (and unportable) hack: taking the address of _DYNAMIC in our C code before we relocate ourselves gets us the unrelocated (file-relative) address, to which we manually add our own base address as snarfed from the auxiliary vector. (Better code would probably declare _DYNAMIC as file-local, using the ELF “hidden” or “internal” visibility, which would allow the linker to fix this up into rip-relative addressing.) Either way, we can get to the _DYNAMIC vector, which gets us to our symbol and relocation tables; then we walk the relocation table and apply each reloc.

static inline void __attribute__((always_inline)) bootstrap_relocate(unsigned char *at_base)
    /* We scan _DYNAMIC to get our own symbol table.
     * HACK: we manually relocate &_DYNAMIC
     * by our load address to get its actual address. */
    ElfW(Dyn) *p_dyn = (void*)(at_base + (uintptr_t) &_DYNAMIC);
    ElfW(Sym) *dynsym_start = NULL;
    unsigned long dynsym_nsyms = 0;
    ElfW(Rela) *rela_dyn_start = NULL;
    ElfW(Rela) *rela_plt_start = NULL;
    unsigned long rela_dyn_sz = 0;
    unsigned long rela_dyn_entsz = 0;
    unsigned long rela_dyn_nents = 0;
    unsigned long rela_plt_sz = 0;
    while (p_dyn->d_tag != DT_NULL)
        if (p_dyn->d_tag == DT_SYMTAB) dynsym_start = (void*)(at_base + p_dyn->d_un.d_ptr);
        else if (p_dyn->d_tag == DT_SYMENT) dynsym_nsyms = p_dyn->d_un.d_val;
        else if (p_dyn->d_tag == DT_RELA) rela_dyn_start = (void *)(at_base + p_dyn->d_un.d_ptr);
        else if (p_dyn->d_tag == DT_RELASZ) rela_dyn_sz = p_dyn->d_un.d_val;
        else if (p_dyn->d_tag == DT_RELAENT) rela_dyn_entsz = p_dyn->d_un.d_val;
        else if (p_dyn->d_tag == DT_JMPREL) rela_plt_start = (void *)(at_base + p_dyn->d_un.d_ptr);
        else if (p_dyn->d_tag == DT_PLTRELSZ) rela_plt_sz = p_dyn->d_un.d_val;
    if (rela_dyn_entsz > 0) rela_dyn_nents = rela_dyn_sz / rela_dyn_entsz;
    /* We loop over the relocs table and relocate what needs relocating. */
    ElfW(Rela) *p_rela = rela_dyn_start;
    for (int i = 0; i < rela_dyn_nents; ++i)
        do_one_rela(rela_dyn_start + i, at_base, dynsym_start);
    p_rela = rela_plt_start;
    /* Also do .rela.plt */
    for (int i = 0; i < (rela_plt_sz / sizeof (Elf64_Rela)); ++i)
        do_one_rela(rela_plt_start + i, at_base, dynsym_start);

I haven't shown the function that applies a single relocation record, do_one_rela, but it's pretty simple. In practice it only needs to know a few flavours of relocation.

static inline void __attribute__((always_inline)) 
do_one_rela(ElfW(Rela) *p_rela, unsigned char *at_base, ElfW(Sym) *p_dynsym)
#define SYMADDR(r_info) (p_dynsym[ELF64_R_SYM((r_info))].st_value)
    Elf64_Addr *reloc_addr = (Elf64_Addr *)(at_base + p_rela->r_offset);
    switch (ELF64_R_TYPE(p_rela->r_info))
        case R_X86_64_RELATIVE: // no symbol addr, because we're RELATIVE
            *reloc_addr = (Elf64_Addr)(at_base + p_rela->r_addend); 
        case R_X86_64_64: 
            *reloc_addr = (Elf64_Addr)(at_base + SYMADDR(p_rela->r_info) + p_rela->r_addend);
        case R_X86_64_JUMP_SLOT:
        case R_X86_64_GLOB_DAT:
            *reloc_addr = (Elf64_Addr)(at_base + SYMADDR(p_rela->r_info));
            /* We can't report an error in any useful way here. */
#undef SYMADDR

We're now getting ready to take on the main task of being a chain loader, which plain donald is not (at least not intentionally). Ideally we want to be invocable in two ways: as a bona fide dynamic linker requested directly by an executable, as we see here:

$ readelf -Wl /bin/blah

Elf file type is EXEC (Executable file)
Entry point 0x401432
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000400040 0x0000000000400040 0x0001f8 0x0001f8 R E 0x8
  INTERP         0x000238 0x0000000000400238 0x0000000000400238 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /path/to/my/ld.so]

and as a command-line tool

$ /path/to/my/ld.so /bin/blah

These two cases work out slightly differently at run time. The first is the “normal” case, and the kernel has already mapped both the program itself and us the dynamic linker. In the latter, the kernel thinks it's loading us as a statically linked binary (curiously not caring that we're ET_DYN not ET_EXEC), and it's up to us to parse our arguments, identify the program we want to run, and load it. The contents of the auxiliary vector are slightly different in the two cases, and indeed that's how we'll tell the difference and behave accordingly.

int main(int argc, char **argv)
    // were we invoked by name, or as a .interp?
    // use AT_ENTRY to find out: it's _start if we were invoked as a program,
    // otherwise it's the program's _start
    int argv_program_ind;
    uintptr_t entry = (uintptr_t) &_start;
    _Bool we_are_the_program = 1;
    for (ElfW(auxv_t) *p = p_auxv; p->a_type; ++p)
        switch (p->a_type)
            case AT_ENTRY:
                if (p->a_un.a_val != (uintptr_t) &_start) we_are_the_program = 0;
                entry = p->a_un.a_val;
    fprintf(stderr, "We think we are%sthe program\n", we_are_the_program ? " " : " not ");

Since we're a chain loader, most of what we want to do is load the next loader and jump to it. To load the next loader, we'll need to memory-map some chunks of the file, by reading its ELF header and program headers. We can simply do this by file I/O.

    /* We always chain-load the ld.so and let it load the program. Let's read it. */
    const char ldso_path[] = "/lib64/ld-linux-x86-64.so.2"; // HACK: sysdep
    int ldso_fd = open(ldso_path, O_RDONLY);
    if (ldso_fd == -1) { die("could not open %s\n", ldso_path); }
    struct stat ldso_stat;
    int ret = fstat(ldso_fd, &ldso_stat);
    if (ret != 0) { die("could not stat %s\n", ldso_path); }

    // read the ELF header
    ssize_t nread;
    ElfW(Ehdr) ehdr; nread = read(ldso_fd, &ehdr, sizeof (ElfW(Ehdr)));
    if (nread != sizeof (ElfW(Ehdr))) die("could not read ELF header of %s\n", ldso_path);
    // check it's a file we can grok
    if (ehdr.e_ident[EI_MAG0] != 0x7f
        || ehdr.e_ident[EI_MAG1] != 'E'
        || ehdr.e_ident[EI_MAG2] != 'L'
        || ehdr.e_ident[EI_MAG3] != 'F'
        || ehdr.e_ident[EI_CLASS] != ELFCLASS64
        || ehdr.e_ident[EI_DATA] != ELFDATA2LSB
        || ehdr.e_ident[EI_VERSION] != EV_CURRENT
        || (ehdr.e_ident[EI_OSABI] != ELFOSABI_SYSV && ehdr.e_ident[EI_OSABI] != ELFOSABI_GNU)
        // || phdr->e_ident[EI_ABIVERSION] != /* what? */
        || ehdr.e_type != ET_DYN
        || ehdr.e_machine != EM_X86_64
        die("unsupported file: %s\n", ldso_path);
    off_t newloc = lseek(ldso_fd, ehdr.e_phoff, SEEK_SET);

    // process the PT_LOADs
    ElfW(Phdr) phdrs[ehdr.e_phnum];
    for (unsigned i = 0; i < ehdr.e_phnum; ++i)
        off_t off = ehdr.e_phoff + i * ehdr.e_phentsize;
        newloc = lseek(ldso_fd, off, SEEK_SET);
        if (newloc != off) die("could not seek to program header %d in %s\n", i, ldso_path);
        size_t ntoread = MIN(sizeof phdrs[0], ehdr.e_phentsize);
        nread = read(ldso_fd, &phdrs[i], ntoread);
        if (nread != ntoread) die("could not read program header %d in %s\n", i, ldso_path);

This is all fairly common to any dynamic linker. Now we've collected the PT_LOAD program headers, which tell us which parts of the file to map into memory. To rule out partial failures, we reserve enough memory to be sure we can map the loader in a big contiguous chunk, even though that might be split over many PT_LOADs.

    ElfW(Addr) max_vaddr = 0;
    for (unsigned i = 0; i < ehdr.e_phnum; ++i)
        ElfW(Addr) max_vaddr_this_obj = phdrs[i].p_vaddr + phdrs[i].p_memsz;
        if (max_vaddr_this_obj > max_vaddr) max_vaddr = max_vaddr_this_obj;
    uintptr_t base_addr_hint = 0x555555556000;
    void *base = mmap((void*) base_addr_hint, max_vaddr, PROT_NONE, MAP_PRIVATE,
        ldso_fd, 0);
    if (base == MAP_FAILED) die("could not map %s with PROT_NONE\n", ldso_path);
    uintptr_t base_addr = (uintptr_t) base;

Once we've done that, we have reserved space at some load address which could be arbitrary (but we'll use a hint of 0x555555556000)... all we need to do is turn each PT_LOAD header into the (usually) one corresponding mmap() invocation.

    uintptr_t phdrs_addr = 0;
    for (unsigned i = 0; i < ehdr.e_phnum; ++i)
        if (phdrs[i].p_type == PT_LOAD)
            _Bool read = (phdrs[i].p_flags & PF_R);
            _Bool write = (phdrs[i].p_flags & PF_W);
            _Bool exec = (phdrs[i].p_flags & PF_X);

            if (phdrs[i].p_offset < ehdr.e_phoff
                    && phdrs[i].p_filesz >= ehdr.e_phoff + (ehdr.e_phnum + ehdr.e_phentsize))
                phdrs_addr = base_addr + phdrs[i].p_vaddr + (ehdr.e_phoff - phdrs[i].p_offset);
            ret = load_one_phdr(base_addr, ldso_fd, phdrs[i].p_vaddr,
                phdrs[i].p_offset, phdrs[i].p_memsz, phdrs[i].p_filesz, read, write, exec);
            switch (ret)
                case 2: die("file %s has bad PT_LOAD filesz/memsz (phdr index %d)\n", 
                        ldso_path, i);
                case 1: die("could not create mapping for PT_LOAD phdr index %d\n", i);
                case 0: break;
                    die("BUG: mysterious error in load_one_phdr() for PT_LOAD phdr index %d\n", i);

Now we've mapped the program we want to run, i.e. the next loader. In a real dynamic linker, our work would only just be beginning: we've mapped the next program, but still have to walk its depended-on libraries and load those (transitively), apply relocations in all the above, and so on. But in our case, we're most of the way there. All we have to do is do whatever tinkering is demanded by our use case, then transfer control to the next loader, using the entry point in its ELF headers. The next loader is not just any old program, however. To cover our tracks and ensure it can find the auxiliary vector just as we ded, we also return the stack pointer to where it was when we started.

void __attribute__((noreturn)) enter(void *entry_point)
    fprintf(stderr, DONALD_NAME ": jumping to system ld.so entry point %p with rsp %p\n",
        (void*) entry_point, rsp_on_entry);
    __asm__ volatile ("movq %0, %%rsp\n"
          "xorq %%rbp, %%rbp\n" /* clear rbp to avoid confusing stack walkers */
          "jmpq *%1\n" : : "m"(rsp_on_entry), "r"(entry_point));

In fact, we need to cover our tracks a bit better. If we're being invoked as a command, not as a requested dynamic linker, then we want to make the auxiliary vector look as if the next loader is the one that was invoked as a command too. So we do a pass over the vector to fix it up. One such fixup is required in the opposite (“requested”) case too.

for (ElfW(auxv_t) *p = p_auxv; p->a_type; ++p)
        switch (p->a_type)
            case AT_ENTRY:
                if (we_are_the_program) p->a_un.a_val = entry_point;
            case AT_PHDR:
                if (we_are_the_program) p->a_un.a_val = phdrs_addr;
            case AT_PHENT:
                if (we_are_the_program) p->a_un.a_val = ehdr.e_phentsize;
            case AT_PHNUM:
                if (we_are_the_program) p->a_un.a_val = ehdr.e_phnum;
            case AT_BASE:
                if (!we_are_the_program) p->a_un.a_val = base_addr;
            case AT_EXECFN:
                if (we_are_the_program) p->a_un.a_val = (uintptr_t) &argv[0];

Note that in the “requested” case, since the AT_BASE vector entry is set to the loader's load address, it needs to be fixed up to that of the next loader. If we're invoked as a command, it'll be absent or 0 and can stay that way. The other cases are ones that only need fixing up if we're run as a command, as the vector needs to reflect the next loader's structure not our own (and not the program's, which it only reflects if we're run as a requested loader).

There's more bit of track-covering we can do, and in fact need to, if we're “requested” rather than invoked. One interesting quirk of what we have so far is that the next loader, although its name is /lib64/ld-linux-x86_64.so.2, will think it's called /path/to/my/ld.so, and will create a link map entry accordingly. This will confuse debuggers, e.g. because they will look at the wrong file's symbols. It took me a while to figure out where the dynamic linker gets its own name, but it's obvious in hindsight: from the .interp string in the mapped program binary! (If we were invoked, it more obviously comes from argv.) So to avoid confusion, we want to overwrite the .interp string before we call the next loader. That takes some jiggery-pokery at link time, but we're already doing some of that to be an alternative requested linker—i.e. to link with --dynamic-linker=/path/to/my/ld.so. Linking in a small additional file assembled from

#if defined(__linux__) && defined(__ELF__)
.section .note.GNU-stack,"",%progbits
.section .interp, "aw"

is enough to make the .interp writable, and can also be used to add more padding if the stock path (which we'll be writing) might be longer than our requested path.

There's another wrinkle with this. Since the default linker script will place .interp among stuff that is otherwise read-only, this extra linked file will also have the side-effect of making writable the program headers, various notes and other stuff appearing in the segment. I'm not yet decided on the best way to deal with this. We could mprotect() this segment after we're finished with it, if we're sure we're the only reason why it should be writable. We could skip the above and simply use a pair of mprotect() calls to do the write, at least if we don't need to add padding. We could try to move .interp into the RELRO (read-only after relocation) part of the executable. Or perhaps we could arrange that our ld.so has two .interp-like chunks, one for its own name and one for the delegated ld.so make them page-aligned, and do a remap rather than a write. This would waste some disk space in the binary, but avoid wasting memory in systems where many processes are loaded this way—a common trade-off in dynamic linking.

We could go further with track-covering, such as attempting to unmap ourselves as we hand over to the next loader. That would involve more sneaky malware-like tricks, and is overkill for most purposes.

So why is all this better than LD_PRELOAD? I can think of several reasons.

Why's it not so good?

The “but...” refers to a possible “big-hammer” way around the setuid problem: make our loader setuid root, albeit dropping root privileges super-early. That sounds crazy, but I believe it'd be an interesting demo for software verification. It only needs to be root just long enough to open the target binary and test for which UID to switch down to (either the invoking user or the file owner). The test could be done in a very short code path making probably just two system calls: one to open a file descriptor on the loaded file, and another to fstat() it. On many architectures it could be done before bootstrap relocation. Can we prove that code correct? Doing so is feasible if we have trustworthy formal specifications of instruction sets, system calls and (if needed) relocation. Not coincidentally, I've been involved in some research on these; there is more to do.

In a future post I'll demo a very simple syscall tracer built using this combination of chain-loading and bootstrapping instrumentation. It's basically an in-process strace. System call instrumentation is usually what people want when they use LD_PRELOAD, and are forced to settle for libc instrumentation instead.

Code is available (though definitely not out-of-the-box demoable; maybe by the next post) in my donald repository, and a bit more (mostly to be explained next time) in my liballocs (see allocsld) and libsystrap.

[/devel] permanent link contact

Thu, 23 Jul 2020

Building a simple toolchain extension, the subversive way

I recently ran into a nasty problem, arguably a bug, in GCC. I decided to write a little tool to help me detect when I'm triggering this bug. This post is about how I did that, using some simple but handy helper scripts I've been growing for a while.

The bug is to do with “constructor” functions in GNU C (not to be confused with C++ constructors!). I'll quote from my bug report, slightly edited.

I was surprised to find that my __attribute__((constructor(101))) on a function was
not giving it priority over another constructor whose attribute used a higher number.
It turns out this was owing to an additional prototype that did not mention the
function as being a constructor.

This caused a really subtle initialization bug in my code, and was tough to track
down. At the least, I would expect a warning when the information is discarded.

$ cat test.c
// comment these for correct behaviour, i.e. 1 then 2
static void do_init1(void);
static void do_init2(void);

static void do_init2(void) __attribute__((constructor(102)));
static void do_init2(void)
        printf("Init 2!\n");

static void do_init1(void) __attribute__((constructor(101)));
static void do_init1(void)
        printf("Init 1!\n");

int main(void)
        return 0;

$ cc -save-temps    test.c   -o test
$ ./test
Init 2!
Init 1!
$ grep -B2 init_array test.s
        .size   do_init2, .-do_init2
        .section        .init_array,"aw"
        .size   do_init1, .-do_init1
        .section        .init_array

The attribute's priority argument is being discarded. It should be visible in the
section name (".init_array.0101" etc.). If the extra prototypes are removed, the
programmer-intended behaviour is restored.

I can see how this would happen. Normally, constructor functions are not called explicitly, so don't need to be prototyped. In my liballocs project, however, sometimes my code can be called during very early initialization of the program. This is mostly because it preloads a wrapper for malloc(), which is called by the dynamic linker before it even gets around to running constructors. To deal with these incoming calls at such an early time, the library can be initialized on demand, earlier than the usual initialization sequence. To allow this, several constructors are also protoyped functions that can be called explicitly across module boundaries.

Although I could do a one-time pass over my code to fix the problem, this wouldn't prevent the same problem recurring in the future. Getting a diagnostic into GCC would be a long, slow process, and wouldn't help if I switched to another compiler that takes some other questionable interpretation of multiply prototyped constructors. So instead, I thought I'd write a simple and fairly standalone tool that can detect the problem during build. In short, I want to check that if a priority is given in a prototype, then that priority is reflected in the assembly code; if not, an error is raised.

My toolsub repository, although in somewhat rough-and-ready shape, provides a collection of scripts that let us interfere with the compilation and linking toolchain in useful ways. Most of the tools it provides take mechanisms provided by existing interfaces, such as the GCC driver's -wrapper option for supplying a wrapper script, but make them easier to use. It's tricky to use -wrapper because you have to parse various command lines, and some of these are compiler-internal commands like cc1 rather than the public cc, cpp and so on.

Firstly I want to do a pass over the preprocesssed C code, outputting a summary of any constructors that are declared with a priority. This priority information is what the compiler is discarding; I want to catch when this happens. To fish out these priorities, I wrote a short OCaml program using CIL. This post isn't about CIL, but you can see that tool here; it's about 80 lines of commented OCaml. (To build it, you'll need to use my branch of CIL, because it uses one API feature that I haven't yet contributed back to mainline.)

This tool calculates what I'll call a “generous” summary of the constructor priorities declared in the file—if you gave a priority on any prototype, it will always include a priority in its output. (Currently it doesn't help much with the case where multiple distinct priorities are used. There is also a bug about that.) What GCC does is the “mean” version: unless you're consistent, it will silently drop the priority; this is the problem we want to catch.

We could just report an error if we see inconsistent prototypes, which is the case that provokes the bad behaviour in GCC. However, I wanted to be a bit more “ground truth” than that: can we test what the compiler's output actually says about the constructor's priority? To do that, we need to know how the constructor priority mechanism works, and indeed about how the constructor mechanism works at all. It's roughly as follows.

For each constructor, the compiler generates assembly code to output a pointer to that function in the .init_array section. At run time, initialization code will walk this section and call each pointed-to function in order (this is done by the dynamic linker, if you're linking dynamically). To give constructors a priority, the compiler tweaks the name of this section so that it is something like .init_array.0123, for a constructor with priority 123. It is then the linker's job to sort these into order. The linker script rule that generates the final .init_array section is something like this.

  .init_array     :  { *(SORT_BY_NAME(.init_array.*)) *(.init_array) }

The real script is actually a bit messier than that, but the point is that for init-array sections named with a suffix, we link them in sorted order of their name, so lower-numbered sections will go first. Section named without a suffix come last.

To check that the compiler didn't discard the constructor priority, we need to find out what section name is used in the assembly code. That's pretty easy if we can run our tool immediately after the compilation step proper, i.e. the step that inputs preprocessed source and outputs assembly code. Doing that is pretty easy with the wrapper feature of toolsub. The following shell script fragment does the work; dumpconstr is the OCaml tool I mentioned, which gives the “generous” summary.

# Our check works by observing a single cc1 invocation...
# already-preprocessed source goes in, and assembly comes out.
# We cross-check each. We pass the cc1 preprocessed input
# to a CIL tool that dumps the with-priority constructors.
# Then we check that the assembly output contains the expected
# incantations in each case. If it doesn't, we flag an warning.
my_cc1 () {
    # drop the literal "cc1" argument
    # run the compiler
    # Now do our check... our caller has told us what infiles and outfile are
    for f in "${infiles[@]}"; do
        case "$f" in
                # use the CIL tool to find out what priorities are at source level
                constrs="$( "$(dirname "$0")"/dumpconstr "$f" )"
                # for each one, check the assembly agrees
                while read func constr prio; do
                    if [[ -n "$prio" ]]; then
                        # the symbol should be referenced in a section
                        # named ".init_array.0*%04d"
                        # where %04d is the priority, zero-padded
                        section="$( sed 's/^[[:blank:]]*\.section[[:blank:]]*/\f/g' "$outfile" | \
                            tr '\n' '\v' | tr '\f' '\n' | \
                            sed -rn "/$regex/ p" )"
                        actual_prio="$( echo "$section" | sed -nr "/${regex}.*/ {s//\2/;p}" )"
                        if ! [ ${actual_prio:-0} -eq $prio ]; then
                            echo "Error: inconsistent use of constructor attributes on function $func" 1>&2
                            exit 1 
                done <<<"$constrs"
                # we process at most one input file
            (*) # skip files that don't look like preprocessed C/C++

# delegate to the generic wrapper -- we've set WRAPPER so it won't re-source the funcs
. ${TOOLSUB}/wrapper/bin/wrapper

It works by defining a shell function, here called my_cc1 which is declared to the main wrapper logic via the CC1WRAP variable. The main wrapper calls this whenever the compiler wants to compile one file to assembly code, after preprocessing. The shell function's arguments are literally the command line that the compiler wants to run, and we run exactly that command. But immediately after that, we also run our OCaml tool on the input file (preprocessed C), then scrape the relevant section names from the output file (assembly code), and check the priorities match.

The tool works! I can drop it into the Makefile of the project I'm using it in, by adding

CFLAGS += -no-integrated-cpp -wrapper /path/to/wrapper

and hey presto, my compilation job complains that my priorities are getting dropped, forcing me to fix the bug. I'm providing the compiler a wrapper script, but most of the wrapper logic is generic stuff provided by toolsub; I only had to write the shell function shown above, and the OCaml program I mentioned, to implement my tool.

So, toolsub's core wrapper logic is pretty useful, but there are at least a couple of things I'd like to fix with it. One is that wrappers should not be dealing in terms of the cc1 command, which is a compiler-internal thing. It'd be better if the core wrapper could translate such commands into “documented” interfaces like the compiler's top-level cc command line. It already does this for preprocessing, translating from a cc1 to a cpp command line (er, mostly). For the assembler it's also handled, trivially since the driver runs this directly. It's not yet done for linking, nor for the compiler proper... the shell function above still gets given a cc1 command, albeit assisted by some other logic that has pre-extracted the input and output filenames. That logic itself remains a bit rough.

The other thing I could fix is that arguably, this wrapper logic should not be using shell scripts. They're neither fast nor pleasant. But actually... it's hard to beat them for simple things... I hate how clunky subprocesses quickly become in Python, for example. In the above example, we call out to an OCaml tool to do the analysis of C source. Scraping the assembler file using regex hackery doesn't seem so bad, since the assembler is line-oriented.

My goal with toolsub is to have it become an umbrella project for small utilities, like the wrapper being used here, that help people modify their toolchain's behaviour in ways that are easy to “drop in” to existing builds. Think “prototyping and personal use”, not production; research prototypes are my particular motivation. The scripts are there to help avoid getting bogged down in compiler-specific things, such as compiler internals or details of the command line. Another of the tools in there is cilpp which again uses CIL, this time packaged to enable CIL passes to be invoked as “preprocessor plugins” (using something like -Wp,-fplugin,/path/to/cil/pass.cmxs). Injecting these as arguments is easier than using an alternative driver like cilly.

A final tool I want to mention is cccppp, unfortunately just a proof-of-concept for now, written in the very fun five days I spent at the Recurse Center in April last year. It was motivated by the lack of anything comparable to CIL for C++. I want to prototype tools that instrument C++, but I don't want my prototypes to get caught up in the fast churn of LLVM and Clang internals. So the idea is to provide a simple, fairly stable tool that can “virtualize” C++ code by lifting all the built-in operators, like pointer dereference or array subscripting, into templates that the tool invoker controls. By default, these templates just do the normal thing. But, for example, to instrument array accesses, the invoker can supply a custom definition for the built-in array access operator (as a template specialization); I'm using this to provide bounds checking. The key idea is to work using only source-level rewrites of the C++ code... a relatively stable feature of the Clang tooling API. And it's a source-to-source pass, so as long as the tool can parse your code, you can run any compiler on what comes out. Expressing these source-to-source rewrites using Clang's API is, however, pretty nasty. It still needs more work before it can work on large programs; I'm hoping it won't need loads more.

What's the lesson here? One could still argue that I should just get this problem fixed in the compilers I care about, rather than writing hacky wrapper scripts. But I tend to think the opposite: during development, every systems codebase should be compiled via a hacky wrapper script! It's a good way to add custom sanity checks for properties that your project cares about, even if most others don't. It suits properties that are quick to check, are reasonably syntactic or local to specific code (at least to specific files or functions), justify a “fail fast” error report (e.g. because it's critical to correctness), and therefore are better expressed as a static check than in a test case. Link-time invariants are another thing that liballocs could use some checking of—though sadly I haven't implemented wrapper support for the linker yet. Some other properties, like naming conventions or checking for ABI stability, arguably fit this category, depending on how strict you want to be. Anyway, we'll see if I end up using this facility more often.

After subverting your toolchain, you may also want to subvert your runtime. I'll write more about that in future posts.

[/devel] permanent link contact

Tue, 14 Feb 2017

Custom ELF program headers—what, why and how

A couple of times recently I've found myself wanting to create ELF files with custom program headers. This post explains what that means, why you might want it, why there's a practical difficulty in doing so using a standard linker, and my slightly hacky way of solving it.

Program headers are metadata within ELF files. They describe a contiguous chunk of the file's address space. There are several different kinds of program header, of which the most critical is LOAD, signifying that the relevant part of the file should be loaded (or mapped) into memory, forming a segment within the program's memory image. Besides LOAD, other program headers are used to attach particular meanings to specific ranges within the binary. For example, the DYNAMIC program header tells the loader where the dynamic-linking metadata resides. The INTERP program header identifies to the (in-kernel) loader a string naming the program interpreter (the dynamic linker, which will actually do the loading proper). Some vendor extensions exist for adding features within the loading process, such as GNU_RELRO (which records a range that can be made read-only after relocation, bringing security benefits) or GNU_STACK (which does not identify an associated range, but exists as a flag to tell the loader whether the stack needs to be made executable). If you run readelf -l on some ELF binary, you'll see a dump of its program headers. On my machine I see something like this.

$ readelf -Wl /bin/true
Elf file type is EXEC (Executable file)
Entry point 0x401432
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000400040 0x0000000000400040 0x0001f8 0x0001f8 R E 0x8
  INTERP         0x000238 0x0000000000400238 0x0000000000400238 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x0058b4 0x0058b4 R E 0x200000
  LOAD           0x005e10 0x0000000000605e10 0x0000000000605e10 0x000404 0x0005f0 RW  0x200000
  DYNAMIC        0x005e28 0x0000000000605e28 0x0000000000605e28 0x0001d0 0x0001d0 RW  0x8
  NOTE           0x000254 0x0000000000400254 0x0000000000400254 0x000044 0x000044 R   0x4
  GNU_EH_FRAME   0x004b4c 0x0000000000404b4c 0x0000000000404b4c 0x00023c 0x00023c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x005e10 0x0000000000605e10 0x0000000000605e10 0x0001f0 0x0001f0 R   0x1

Two reasons for wanting to add custom program headers are to add custom features to the loading process, or to make new uses of features it already provides. My first use case was an instance of the latter: to create a heap area at load time, as an additional segment, rather than the usual way of making explicit syscalls during execution. I was building ELF files for the ppcmem emulated execution environment. This doesn't implement any system calls, but does implement an ELF loader. By adding an extra LOAD program header providing an allocation arena, I could write a malloc() that does not require system calls—its heap arena is mapped at start-up by the ELF loader inside ppcmem. (Although I could have allocated space using a big char array, or by hacking the linker script to leave a big gap, a separate segment is logically cleaner, and a disjoint address range for heap allocations is helpful.)

The second use case for custom program headers comes from liballocs. When passed through the liballocs toolchain, a binary also generates extra metadata binaries which can optionally be loaded at run time to supply extra metadata similar to debugging information. Currently these files live outside the main binary, in a separate /usr/lib/allocsites hierarchy, which is a pain for deployment: if you move or copy an executable, you must move or copy its metadata as an extra step. I'd like to bundle the metadata into the binary somehow, and record its presence using a custom program header.

The way to specify the program headers is in the linker script. The problem with creating custom program headers is that a GNU-style linker's script language requires you to specify the entire set of program headers if you specify any program header at all. You can't just add one extra. Worse, the default set of headers is not even made explicit anywhere. The PHDRS command is what specifies the program headers, but the default linker script (try ld --verbose to see it) doesn't use it. Rather, it falls back on the linker's hard-coded default behaviour, which is to create all the familiar headers that we see above.

If we write our own PHDRS command we can create all the program headers we want—but we have to create them all ourselves, meaning manually creating all the ones that the linker would otherwise create for us. The set of headers created by hard-coded logic in the linker has grown over the years, to include things such as the GNU_STACK and GNU_RELRO headers I mentioned earlier. Re-creating this “standard” set of program headers is therefore quite a detailed platform-specific task, and is subject to change as new features are added; a burden we don't want to take on.

So, I came up with a way of getting the effect we want: “please create this extra program header, in addition to all the normal ones”. It is not pretty and not perfect, but it works as follows. Run the linker once, without any custom program header directives. Remember what program headers the linker generated, and what sections it assigned to them. Dump the linker script used during that link, and annotate it so that each output section is labelled with its program header assignment. Then, apply a diff to the linker script adding only the program header we care about. Finally, re-link with the diffed script.

The obvious downside is that we link twice. We're also a bit fragile to the organisation of the default linker script. Still, I'm prepared to pay these prices. I've implemented this approach with a pleasingly nasty (but small) pile of shell scripts, make rules and m4 macros. It now works pretty well for me. Here's a brief tour.

First, we have some makerules to dump the default linker script.

default-dynexe.lds: $(shell which ld)
    $(LD) --verbose 2>&1 |\
      sed -e '/^=========/,/^=========/!d;/^=========/d'\
          -e 's/\. = .* + SIZEOF_HEADERS;/& _begin = . - SIZEOF_HEADERS;/'\
      > "$@"

(Thanks to Dan Williams for the sed script, which I took from his page about creating a custom dynamic loader, which is well worth a read.)

Next, we have some makerules to turn such a script into an m4'ised macro-script. Essentially, this macroisation adds two “hooks” into the script: one to include a file describing the “normal” program headers and insert any new ones; and one to wrap each output section in a macro invocation that will decorate it with its corresponding program header names. This macroisation rewrite is easily done with regular expressions.

%-dynexe.lds.phdrify.m4: %-dynexe.lds
    cat "$<" | sed 's#SECTIONS#PHDRS\n{\n\tinclude(phdrs_dynexe.inc)\n\tinclude($(srcdir)/phdrs_extra.inc)\n}\nSECTIONS#' | \
    tr '\n' '\f' | sed -r \
    's@(\.[-a-zA-Z0-9\._]+)[[:space:]\f]+(([^\{]*[[:space:]\f]*)?\{[^\}]*\})@expand_phdr([\1], [\2])@g' | \
    tr '\f' '\n' > "$@"

Next, our phdrified linker script contains output section blocks that used to look like this

  .rodata         : { *(.rodata .rodata.* .gnu.linkonce.r.*) }

now wrapped in macro invocations, like this.

  expand_phdr([.rodata], [: { *(.rodata .rodata.* .gnu.linkonce.r.*) }])

Once we have a macroised linker script, we need to m4-process it with some macro definitions that turn it back into the script we want. We need two kinds of definition for this. The first is a list of the program headers in the file as it was originally linked. The easiest way is to use readelf to dump the existing ones. As a wart: since (on my machine at least) the linker doesn't understand the full range of symbolic (named) values for the program header's type field, we use the C preprocessor to macro-expand the symbolic names down to hex literals (as defined in elf.h), which it does understand. The second kind of definition is going to expand each output section to explicitly mention which program headers it belongs to. Note that a section can be in more than one program header. I used a bash script with associative arrays, again operating over readelf output, which handily outputs the mapping from segments to section lists. The script inverts this mapping, into a section-to-segment-list mapping. This generates a single m4 macro whose body is a giant if-else chain testing the argument section name against all the sections in the file, and yielding the list of program headers that the section is in. The easiest way is to use readelf to dump the existing ones. The result is a chunk of m4-ified text that describes the program headers in the original link, looking something like this.

phdr0 6;
phdr1 3;
phdr2 1;
phdr3 1;
phdr4 2;
phdr5 4;
phdr6 0x6474e550;
phdr7 0x6474e551;
phdr8 0x6474e552;

After all that. we then simply m4-include a second file describing the custom ones, and use m4 to expand out the program header assignments. Our output section before has now become

  .rodata : { *(.rodata .rodata.* .gnu.linkonce.r.*) } :phdr2

i.e. recording that the read-only data is in program header named phdr2 (the first LOAD in the dump we saw early on; this is the text segment, which also holds read-only data).

When we re-link with this script, we can add whatever extra linker inputs we want to go into our custom segment, and also apply a diff to the linker script. A simple diff looks like this, where the stuff in sections whose names begin  .insert will end up in a new segment.

--- default-dynexe.lds  2016-09-14 23:12:31.000000000 +0100
+++ insert-dynexe.lds   2016-09-14 23:13:46.000000000 +0100
@@ -187,6 +187,9 @@
   . = ALIGN(64 / 8);
   _end = .; PROVIDE (end = .);
   . = DATA_SEGMENT_END (.);
+  /* srk: new segment! */
+  .insert  : ALIGN(0x1000) { *(SORT_BY_NAME(.insert*)) } :insert
   /* Stabs debugging sections.  */
   .stab          0 : { *(.stab) }
   .stabstr       0 : { *(.stabstr) }

The .insert output section will also be aligned to a 4kB boundary, and be placed in the program header called “insert” by the script. If the input section is allocatable, it will also go into a LOAD header. By varying this patch, you can put whatever section you like in your program header, aligned however you like.

There are some warts in all this. One is that I found increasing the number of program headers would confuse the GNU linker into placing some of its implicitly generated sections, .gnu.build-id and .note.ABI-tag, at a silly offset in the file, and that would mess up the address assignment, by causing the file header no longer to be mapped at virtual address 0. To hack around this, I inserted a one-line sed command which explicitly places the .note.ABI-tag section. Another wart was that the PHDR header was no longer generated properly: it was coming out with an address of 0, not of the executable's start address (0x400000). This turned out to be because no LOAD header was including the region of the file containing the file and program headers themselves. To hack around this, we add a little more rewriting magic to our shell scripts, to put the PHDR and FILEHDR linker attributes to any program header entry with type PT_PHDR, and also to the text segment's LOAD program header (usually the first), then updating the latter's file offset and memory addresses so that the segment actually spans that part of the file.

I mentioned that I wanted to use custom program headers to embed one ELF file (a metadata object) inside another. How can we do this? Firstly, we embed the metadata ELF file into the enclosing ELF file and mark it with a custom program header. Secondly, at run time, when we want to load the metadata, we locate the program header and get a file descriptor pointing at the embdded content. Then we need a dynamic linker that can dlopen() from a file descriptor; we can hack the existing GNU dynamic linker for this fairly easily (not shown, but I've done it; ask me). Finally though, we need to handle the fact that file offsets in the embedded content will not match file offsets on this file descriptor! So we need to translate all the ELF file offsets in the embedded file by a fixed amount. File offsets live in the ELF header, program headers and section headers. (They might also live in other sections, but I'm not aware of any.) I wrote a simple C program to add a fixed offset to these. There's also some guesswork needed to figure out what that offset should be, since the linker will be in control of the in-file placement of the extra output section. This could be made reliable using one more link pass (bringing the total to three).

There's a GitHub repository containing these recipes and examples, here. It's all a bit GNU-specific and underdocumented for now.

All this effort to achieve a fairly simple task: package one ELF object inside another so that the “inner” object is loadable when running the “outer” object. What's another way of looking at this problem? We can think of ELF files as describing an initial state of the program (or, in the case of a library, a fragment of state). What we'd like them to be able to do is specify that this state includes a given file of a particular name and contents. In other words, we'd like them to describe not only aspects of the memory state, but also aspects of the filesystem state. Linkers already do something a bit this, namely static linking: objects that might not be on the destination filesystem can be bundled into the file, albeit losing their identity as files. Thinking of binaries as “partial initial program states” in this way, there's a clear relationship between ELF executables, Docker images and other collections of memory(-mappable) objects.

Our “embedded ELF file” requirement would be trivial if segments were nameable as files in the first place. This is of course a design feature of Multics, as was dynamic linking. Although the latter has now been picked up by Unix, files and segments have still not been fully re-unified. Maybe it's time? Dynamic linking is one of two key mechanisms, along with filesystems in userspace (whether this or this) that allow us to prototype extensions to Unix without pushing our changes into the kernel. There are usually performance, security or correctness reasons why such extensions are better off ultimately residing at least partly in the kernel. But making a playground entirely in user space is valuable for research and experimentation, and is something I'm working up to. The “embed part of a filesystem inside your binary” idea is one example of the sort of feature that we want to be easily prototyped in such a system. The core features we need—serving the content, and then binding it into the process's namespace”are already provided by the OS, so it's a matter of joining them together.

[/devel] permanent link contact

Mon, 30 Jan 2017

Debugging with the natives, part 2

Some aeons ago, I wrote a piece giving a rough outline of how a native debugger like gdb works, and promised a follow-up that looks at the same pile of techniques in a more principled way. I can't really excuse the delay, but if anyone's still listening, here goes.

Source-level debugging of native code is supported on (I make it) four different different levels in a complete hardware–software stack.

We can split these four into two pairs. The first two implement machine-level primitives and their “upward mapping” (virtualisation) into the operating system's process abstraction. The second two implement the source-level view that the programmer usually prefers, again mapping upwards from binary- to source-level.

Let's take the machine-level pair to begin with. From the operating system's point of view, all debugging support is designed around the following principles. (Perhaps I shouldn't say “designed” since in reality they “grew”— and only became principles later.)

Some surprisingly strong properties result from this design. Firstly, debugging can be done from a remote process, perhaps on a separate machine from the debuggee, perhaps even a machine of a different architecture. Secondly, debugging can be done post-mortem. Thirdly, the same infrastructure works for many source languages—albeit trivially so far, since we've only seen how to get an assembly-level view. There are some contrasts here with most language virtual machines (think JVM): these implement debugging using in-VM debug servers. These can work across the network, but don't support post-mortem debugging, and typically bake in concepts from source languages.

That's enough about the machine-level part. To go from machine- or assembly-level debugging to source-level debugging, we need help from the compiler. This is designed around the following principles.

Let's do a more concrete run-through of how it works. So far I've been fairly generic, but let's fix on GNU/Linux as our modern Unix—though all ELF-based systems are pretty similar—and a familiar architecture (x86-64) and specific metadata format (DWARF).

When I compile a program with -g it means “please generate metadata”. First, let's try without.

$ cc -o hello hello.c
$ readelf -S hello | grep debug  # no output! no debugging sections

You can still debug this program, at the assembly level, because the OS debugging mechanisms remain available. It's as if the compiler-generated assembly is code that you wrote manually by yourself. You can set breakpoints, watchpoints, single step, and so on.

$ gdb -q --args ./hello
Reading symbols from ./hello...(no debugging symbols found)...done.
(gdb) break main
Breakpoint 1 at 0x40052a
(gdb) run
Starting program: /tmp/hello 

Breakpoint 1, 0x000000000040052a in main ()
(gdb) disas
Dump of assembler code for function main:
   0x0000000000400526 <+0>:     push   %rbp
   0x0000000000400527 <+1>:     mov    %rsp,%rbp
=> 0x000000000040052a <+4>:     mov    $0x4005c4,%edi
   0x000000000040052f <+9>:     callq  0x400400 <puts@plt>
   0x0000000000400534 <+14>:    mov    $0x0,%eax
   0x0000000000400539 <+19>:    pop    %rbp
   0x000000000040053a <+20>:    retq 

Now let's compile with debug information.

$ cc -g -o hello hello.c
$ readelf -S hello | grep debug
  [28] .debug_aranges    PROGBITS         0000000000000000  00001081
  [29] .debug_info       PROGBITS         0000000000000000  000010b1
  [30] .debug_abbrev     PROGBITS         0000000000000000  00001142
  [31] .debug_line       PROGBITS         0000000000000000  00001186
  [32] .debug_frame      PROGBITS         0000000000000000  000011c8
  [33] .debug_str        PROGBITS         0000000000000000  00001210

What's in these debug sections? There are three main kinds of information. Firstly, there are files and line numbers (.debug_line). These encode a mapping from object code addresses to source coordinates (file, line, column). You can dump it fairly readably, as follows.

$ readelf -wL hello 
Decoded dump of debug contents of section .debug_line:

CU: hello.c:
File name                            Line number    Starting address
hello.c                                        4            0x400526
hello.c                                        5            0x40052a
hello.c                                        6            0x400534
hello.c                                        7            0x400539

Secondly, there is frame information (often this comes out in a section called .eh_frame so I cheated a bit above; to get exactly the above with a current gcc, you should add the -fno-dwarf2-cfi-asm switch). This tells the debugger how to walk the stack. What does walking the stack mean? Roughly it means getting a sequence of stack pointers paired with their program counter values, for each call site that is active on the callchain. The old stack pointer and program counter are always saved somewhere, otherwise we couldn't return from the call. To walk the stack we start from the current “live” register file given to use by ptrace(), which holds the “end” stack pointer and program counter. The DWARF then describes how to “rewind” these register values, and/or any other registers whose job is to record the callchain (rbp on x86-64; other callee-saves are often included too) back to their state at the previous call site in the chain. The description of this unwinding is logically a table, which you can see below for the main function. The cells are expressions describing how to compute the caller's value for a register, here rbp value (frame pointer) and also the caller's program counter, i.e. the return address (given as ra). The computations are both factored into two. Firstly we calculate a “canonical frame address” from the current frame's register values (see the CFA column): it's a fixed offset from rsp and rbp, and is actually a fixed address on the stack, but the expression changes from instruction to instruction as the stack pointer gets adjusted. Secondly we obtain the saved values we want by reading from the stack at fixed offsets from that address (c-8 means 8 bytes down). This factoring helps compactness, because the CFA-relative offsets don't change when the stack pointer moves; only the CFA column needs to describe that. However, although “stored at some offset from the CFA” covers a lot of cases, sometimes more complex computations are required, which usually appear as DWARF bytecode expressions.

$ readelf -wF hello
00000088 000000000000002c 0000001c FDE cie=00000070 pc=0000000000400526..000000000040053b
   LOC           CFA      rbp   ra    
0000000000400526 rsp+8    u     c-8 
0000000000400527 rsp+16   c-16  c-8 
000000000040052a rbp+16   c-16  c-8 
000000000040053a rsp+8    c-16  c-8 

The .debug_info section is the biggie. It describes the structural detail of the source program along both source and binary dimensions. It has a list of source files, but also a list of compilation units. The latter is where most of the structure is. It describes functions/methods, data types, and all the language-implementation decisions that the compiler took when generating binary code: how data types are laid out, which registers or stack slots hold each local variable over its lifetime, and so on. Although not shown much in the simple case shown below, addresses of program variables are described in a Turing-powerful stack machine language which is essentially a bytecode; the DW_OP_call_frame_cfa below is one operation, which simply says “push the address of the frame base, as recorded by the frame info”. The tree-like structure of the information also describes detailed static structure of code, including function inlining, the in-memory locations corresponding to particular lexical blocks in the code, and so on. (It's worth asking whether DWARF info it might usefully bundle the source code itself. I've never seen this done, but it would make a lot of sense to me.)

$ readelf -wi hello
Contents of the .debug_info section:

  Compilation Unit @ offset 0x0:
   Length:        0x8d (32-bit)
   Version:       4
   Abbrev Offset: 0x0
   Pointer Size:  8
 <0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
    <c>   DW_AT_producer    : (indirect string, offset: 0x2f): GNU C 4.9.2 -mtune=generic -march=x86-64 -g -fno-dwarf2-cfi-asm -fstack-protector-strong
    <10>   DW_AT_language    : 1        (ANSI C)
    <11>   DW_AT_name        : (indirect string, offset: 0x88): hello.c
    <15>   DW_AT_comp_dir    : (indirect string, offset: 0xb5): /tmp
    <19>   DW_AT_low_pc      : 0x400526
    <21>   DW_AT_high_pc     : 0x15
    <29>   DW_AT_stmt_list   : 0x0
 <1><2d>: Abbrev Number: 2 (DW_TAG_base_type)
    <2e>   DW_AT_byte_size   : 8
    <2f>   DW_AT_encoding    : 7        (unsigned)
    <30>   DW_AT_name        : (indirect string, offset: 0x0): long unsigned int
 <1><34>: Abbrev Number: 2 (DW_TAG_base_type)
    <35>   DW_AT_byte_size   : 1
    <36>   DW_AT_encoding    : 8        (unsigned char)
    <37>   DW_AT_name        : (indirect string, offset: 0xa2): unsigned char
 <1><3b>: Abbrev Number: 2 (DW_TAG_base_type)
    <3c>   DW_AT_byte_size   : 2
    <3d>   DW_AT_encoding    : 7        (unsigned)
    <3e>   DW_AT_name        : (indirect string, offset: 0x12): short unsigned int
 <1><42>: Abbrev Number: 2 (DW_TAG_base_type)
    <43>   DW_AT_byte_size   : 4
    <44>   DW_AT_encoding    : 7        (unsigned)
    <45>   DW_AT_name        : (indirect string, offset: 0x5): unsigned int
 <1><49>: Abbrev Number: 2 (DW_TAG_base_type)
    <4a>   DW_AT_byte_size   : 1
    <4b>   DW_AT_encoding    : 6        (signed char)
    <4c>   DW_AT_name        : (indirect string, offset: 0xa4): signed char
 <1><50>: Abbrev Number: 2 (DW_TAG_base_type)
    <51>   DW_AT_byte_size   : 2
    <52>   DW_AT_encoding    : 5        (signed)
    <53>   DW_AT_name        : (indirect string, offset: 0x25): short int
 <1><57>: Abbrev Number: 3 (DW_TAG_base_type)
    <58>   DW_AT_byte_size   : 4
    <59>   DW_AT_encoding    : 5        (signed)
    <5a>   DW_AT_name        : int
 <1><5e>: Abbrev Number: 2 (DW_TAG_base_type)
    <5f>   DW_AT_byte_size   : 8
    <60>   DW_AT_encoding    : 5        (signed)
    <61>   DW_AT_name        : (indirect string, offset: 0xb0): long int
 <1><65>: Abbrev Number: 2 (DW_TAG_base_type)
    <66>   DW_AT_byte_size   : 8
    <67>   DW_AT_encoding    : 7        (unsigned)
    <68>   DW_AT_name        : (indirect string, offset: 0xb9): sizetype
 <1><6c>: Abbrev Number: 2 (DW_TAG_base_type)
    <6d>   DW_AT_byte_size   : 1
    <6e>   DW_AT_encoding    : 6        (signed char)
    <6f>   DW_AT_name        : (indirect string, offset: 0xab): char
 <1><73>: Abbrev Number: 4 (DW_TAG_subprogram)
    <74>   DW_AT_external    : 1
    <74>   DW_AT_name        : (indirect string, offset: 0xc2): main
    <78>   DW_AT_decl_file   : 1
    <79>   DW_AT_decl_line   : 3
    <7a>   DW_AT_prototyped  : 1
    <7a>   DW_AT_type        : <0x57>
    <7e>   DW_AT_low_pc      : 0x400526
    <86>   DW_AT_high_pc     : 0x15
    <8e>   DW_AT_frame_base  : 1 byte block: 9c         (DW_OP_call_frame_cfa)
    <90>   DW_AT_GNU_all_tail_call_sites: 1
 <1><90>: Abbrev Number: 0

That's it for the tour. Let me finish with some reflections on what's good and bad about this way of doing things.

Descriptive debugging: the good

Why do debugging this convoluted, long-winded way, with all this metadata, instead of the apparently simpler VM way of doing things? In VMs, debug servers are integrated into the runtime, and offer a fixed, high-level command interface. This allows the VM's compile-time implementation decisions to stay hidden. There is no need to tell the debugger how storage is laid out, where local variables live, and so on, because the command interface is pitched at a higher level of abstraction; those details remain internal to the VM. This is convenient for the VM implementer, since generation of debugging information is onerous to implement. But it also requires cooperation from the debuggee, couples debuggers to a fixed command language or wire protocol, and presents strictly less information to the developer. While VM debuggers are designed around the abstraction boundaries of the source languages, metadata-based debugging actively enables descending through these boundaries. It is sometimes very useful for debugging tools to expose implementation details in this way. The most obvious case is when faced with a compiler or VM bug; the user would like to “shift down” to the lower level to inspect the assembly code or VM state. At other times, there are performance bugs that the developer has a hunch are about cache or paging effects; being able to see the raw addresses and raw memory contents can help here, even when the program is running on a VM.

Being highly descriptive, debugging metadata documents a large number of implementation decisions taken by the compiler, so is useful not only to debuggers but also to profilers, language runtimes (C++ exception handling is usually built on DWARF frame information), other dynamic analysis tools such as Valgrind family, and so on.

Debugging optimised code (without deoptimising)

Debugging metadata must describe optimised code. By contrast, VM debug servers typically arrange that debug-server operations only need to deal with unoptimised stack frames and at most simply-generated code (e.g. from a template-based JIT). Confusingly, even the “full-speed debugging” feature of HotSpot uses dynamic deoptimisation to get back to unoptimised code—the earlier approach was to run the whole program under the interpreter whenever you wanted a debuggable execution. In general, a debuggable VM instance must either refrain from optimisation, or know how to dynamically undo that optimisation when a debugger is attached. So, dynamic deoptimisation is not exactly “full speed”—unlike with native debuggers, execution still slows down significantly when a debugger is attached. Having the VM implement debug operations only over unoptimised code is a restriction that helps make the debug server simple, at some cost in debug-time performance.

The flip side is that VM debugging is pretty good at precisely maintaining the source-level abstraction (modulo VM bugs), without complicating the task of implementing optimisations. Meanwhile, in Unix-land, the debugging experience remains best-effort and only as good as the compiler-generated metadata, which is sometimes wrong or incomplete following complex transformations. When optimisation and debuggability are in tension, debuggability usually takes the hit, so a smooth debugging experience still sometimes relies on a separate unoptimised “debug build”. Tail call optimisations are a classic debugging-hindering optimisation, since they rely on eliding stack frames, meaning the debugger cannot know how many recursive calls are logically on the (source-level) stack. Instruction scheduling is another: the order of operations in the executed code need not match source program order, and this can make for debug-time confusion.

The control problem

Some other properties are incidental but tend to be true of current Unix-style debugging. Debugger support for exercising the target language's control features (exceptions, threads, allocations, ...) is uneven, because this can't be described purely as metadata; there needs to be some interaction with the host runtime. DWARF and similar debugging information is good at the “statics” of describing how to decode the program state, but not good at describing the protocols of interaction with a language runtime, necessar for performing operations such as spawning a thread, allocating an object or throwing an exception. These tend to be difficult unless these happen to be cleanly exposed as entry points in the language runtime. In practice debuggers usually achieve these things by having magic knowledge of particular implementations.

At least one semi-portable interface has emerged with the aim of encapsulating run-time control operations for debuggers' benefit. I'm thinking of libthread_db, best described by Joe Damato's excellent article. Unfortunately it's an abomination, because it violates the principle that implementation details are described by architecture-independent metadata. An odd but cleaner and more consistent alternative would be to bundle snippets of DWARF bytecode for doing these runtime interactions—perhaps in the debug information of a language runtime, either simply calling into the runtime (for cleanly abstracted operations) or doing something more complex. But that is only a technical possibility; there are no proposals or working demos of that as far as I'm aware (maybe I'll make one). This might sound wacky, but if you know about the early history of Java, in Oak and the Green Project, you'll see a certain uncanny similarity in these ideas.

Levels of abstraction

Debugging at multiple levels of abstraction is a neat facility of Unix-style debugging, but is also a difficult power to control. It can be useful to switch down to the assembly-level view, or to switch languages, but this capability doesn't generalise to the case where many abstraction layers are built up within the same language (think C++). The debugger will let you “continue to the next source line”, but it doesn't know how to keep you at the same abstraction level. If the next source line is deep inside a library implementing something fairly basic like a smart pointer, it will skip only as far as that line, whereas you probably wanted to stay roughly at the same level of abstraction, or perhaps within the same codebase. Things get particularly bad when there is a lot of inlining (again with C++). The traditional “step over” and “step into” are hints at this need, but are too crude.

Doing better is currently beyond the debugger's ken, but this problem could be solved: perhaps by bringing in the knowledge of libraries and source file trees that the debugger already has, or perhaps most simply by allowing programmers to manually mark out the boundaries between layers. This could be a simple partitioning over source files and binary objects, or could be something more complex, perhaps sensitive to calling context or argument values (consider the case of the same library used from two places in the same application). “Next”-style operations could then be defined in terms of these layers. I'd love to see this, although the details would take a lot of working out.

To be continued?

There's plenty of research to be done on more debuggable language implementations. This research area doesn't seem to get the attention it deserves. One problem is that debugging is usually an afterthought for researchers, even though it is essential for programmers. Another problem is that not many PL researchers have sight of the design space; they're familiar with either the Unix-style approach or the VM-style one. I hope that in the future we can figure out how to get the best of both worlds.

[/devel] permanent link contact

Thu, 25 Feb 2016

Debugging with the natives, part 1

By popular request, I'm going to run a couple of articles explaining how native Unix debuggers like gdb or lldb work. In this one, we'll take a quick cross-section of what's going on. I'll say “gdb” but I could be talking about any similar debugger.

Starting things up

When you start a program in a debugger like gdb, it does a variation of the usual fork()exec() trick, in which it sets itself up as a tracer of the child process before it exec()s the program you want to run. Alternatively, the same API can attach to a process already running. Either way, the effect is that any signals received by the “tracee” process will be delivered first to the debugger, and any system calls made by the tracee will also generate a signal. These signals can be inspected, discarded or modified as the debugger wishes, using the ptrace() API. In effect, the process (tracee) and the debugger (tracer) execute by mutual hand-off, suspending and resuming each other.

Walking the stack

Using the same API, gdb can access the raw memory image of the debugged program, and its register file. (Note that when gdb is active, the program is suspended, so its registers are saved in memory.) It can use this to walk the stack and print a backtrace: the register file gives us the program counter and stack pointer. If we have a nice simple stack which saves frame pointers, it simply walks the chain in memory. To print symbolic function names, it can use either the symbol table (if we must) or the compiler-generated debugging information (better; more on that in a moment).

Reading local state

To print a local variable, the debugger uses the same compiler-generated debugging information to look up where that variable is currently located. This lookup is keyed on the program's current program counter, to handle the way that locals get moved around between registers and the stack from instruction to instruction. The result of the lookup might be a memory location, a register number, or even, sometimes, something composite like “first 8 bytes in register X, rest in memory location Y”.

Setting breakpoints

To set a breakpoint on a particular source line, gdb uses a different part of the debugging information which encodes a mapping between binary locations (program counter values) and source file/line/column coordinates. A small subset of program counter values are specially identified as “beginning of statement” instructions; these are normally the ones the user wants to break on. Sometimes, hardware debug registers can be used to implement breakpoints (up to some maximum); the CPU generates a trap on reaching the program counter value stored in the register. More classically, the “soft” method is for the debugger to overwrite the instructions with trap instructions, saving the old instruction. This trap will cause a signal to be generated when the program executes it.

Resuming from breakpoints

To resume from a software breakpoint, a bit of juggling is required. Recall that we overwrote the original instruction. We can replace it, then ask the OS to single-step the program (using the hardware's single-step mode), then re-set the trap. This is racy in multithreaded programs when the other threads aren't stopped. So instead a good debugger will emulate the instruction itself! This means using an internal interpreter for the target instruction set, backed by the memory and register file access that it already has via ptrace().

Conditional breakpoints and expression evaluation

If we have a conditional breakpoint, we also need to be able to evaluate expressions when we take the trap (and silently resume if the expression is false). For this, the debugger has a parser and interpreter for each source language it supports. These interpreters are very different from an ordinary interpreter: the expression's environment comes from the binary contents of the debugged program's image, rather than being a structure the interpreter itself maintains. (The interpreter keeps a small amount of local state, such as for subexpression results, but it's mostly in the debuggee.)

Calling functions

To evaluate expressions containing function calls, we need to be able to run code in the debuggee. To do so, we craft a special frame on the debuggee's stack, a safe distance away from its actual top-of-stack. Having evaluated the arguments as usual, we put them in the relevant registers (saving what was there), tweak the saved stack pointer to point to this crafted state, set the instruction pointer to the callee, and set the debuggee going again. The function runs as normal, and on return it uses an on-stack return address that the debugger supplied. This points at a breakpoint-like instruction that raises a signal, returning control to the debugger. This then cleans up the stack as if nothing happened, and resets the registers accordingly. (Disclaimer: I haven't actually dug through gdb's code here, so I might have one or two things subtly wrong. At least, the above is how I'd implement it. Let me know if you know better.)

It goes on...

There's lots more, but if you understand the above, you've probably got the gist.

All that seems like a big bag of tricks. What are the principles at work in this design? Although the system has “grown” rather than being “designed”, and has a generous share of quirks and unstandardised corners, there are some surprisingly strong principles lurking in it. I've written a little about these in my paper at Onward! last year. Next time, I'll cover briefly some of the same observations, and a few others, hopefully thereby re-explaining how debuggers work, a bit more systematically .

[/devel] permanent link contact

Thu, 16 Jul 2015

ELF introspection, robustly and portably

I often find myself writing code that does introspection at the ELF level. This is for things like looking up symbols, walking the link map, figuring out which shared object a particular memory address belongs to, which memory regions are executable, and so on. These are fairly common things to want to do in ptrace()- or LD_PRELOAD-based tools, or binary interpreter-like (Valgrind-style) ones. This post is about how to do these things robustly from within a process, where “robustly” brings some particular restrictions.

What introspection?

What does our address space look like? To spare myself the pain of drawing diagrams, we can ask Linux to dump a somewhat-pictorial view of a simple cat process by doing cat /proc/self/maps. I have reversed the order of the lines so that higher addresses go higher up, and tweaked the formatting slightly.

          address range           flgs  offset   dev   inode                  pathname or other name   
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

    7fff301fe000-    7fff30200000 r-xp 00000000 00:00 0                  [vdso]

    7fff301a5000-    7fff301c7000 rw-p 00000000 00:00 0                  [stack]

    7f4a574bc000-    7f4a574bd000 rw-p 00000000 00:00 0 
    7f4a574bb000-    7f4a574bc000 rw-p 00023000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a574ba000-    7f4a574bb000 r--p 00022000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a57493000-    7f4a574ba000 rw-p 00000000 00:00 0 

    7f4a57298000-    7f4a572bb000 r-xp 00000000 08:01 889388             /lib/x86_64-linux-gnu/ld-2.19.so
    7f4a57293000-    7f4a57298000 rw-p 00000000 00:00 0 
    7f4a57291000-    7f4a57293000 rw-p 001bf000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a5728d000-    7f4a57291000 r--p 001bb000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a5708e000-    7f4a5728d000 ---p 001bc000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a56ed2000-    7f4a5708e000 r-xp 00000000 08:01 889403             /lib/x86_64-linux-gnu/libc-2.19.so
    7f4a56c09000-    7f4a56ed2000 r--p 00000000 08:05 2750795            /usr/lib/locale/locale-archive
        00838000-        00859000 rw-p 00000000 00:00 0                  [heap]
        0060c000-        0060d000 rw-p 0000c000 08:01 286233             /bin/cat
        0060b000-        0060c000 r--p 0000b000 08:01 286233             /bin/cat
        00400000-        0040c000 r-xp 00000000 08:01 286233             /bin/cat

This picture is a refinement of the usual OS textbook diagram of a Unix address space. As usual, we have the executable's text segment (bottom), read-only and then writable data (next two up), heap (next up), a stack (much higher) and the kernel region (up top). (Actually the kernel region is not shown, except for one small part, the vsyscall page, which is the only user-accessible area.) The biggest difference is that we have not only the executable but some libraries—primarily the C library (since cat doesn't use others), but also the dynamic linker (itself a library) and the mysterious vdso. All these are ELF objects. In general, a large subset of a process's structure is actually defined by the collection of loaded ELF objects and a little per-process structure also defined by ELF. Our problem is now: how can we acquire a view on this ELF-level structure, from within the program itself at run time, efficiently and robustly?

(The above listing also shows a few other flavours of mapping that won't concern us here: a memory-mapped file /usr/lib/locale/locale-archive, and two anonymous mappings, beginning at 0x7f4a57493000 and 0x7f4a57293000, which are almost certainly heap areas—we can have many heaps, not just the one that grows out of the executable's data segment. And of course, although we see only one mapping labelled [stack], in a large multithreaded process there will be many such.)

What's ELF?

The basics of ELF linking were first defined in System V Release 4 in 1988, and eventually adopted by non-commercial Unices starting in the mid-1990s (1995 for Linux, with kernel 1.2; 1998 for FreeBSD, with version 3; NetBSD and OpenBSD in 2001 and 2003 respectively). The “classic” picture of a loadable ELF file (direct from the spec) looks like the following.

Execution View
ELF Header
Program header table
Segment 1
Segment 2
Segment 3
Section header table

But actually, that omits some of the most interesting stuff! Let's try again.

Execution View
ELF Header
Program header table
Dynamic linking metadata
Dynamic symbol table
Dynamic relocation table
Segment 1
Segment 2
Segment 3
Section header table
Symbol table

Dynamic symbols, program headers and the miscellaneous dynamic linking metadata are what we're interested in introspecting on. The POSIX libdl API offers one interface for doing introspection at the ELF level, at least for looking up symbols by name. Non-portable extensions, like SunOS/Solaris's dladdr() and the later-added dladdr1(), together with GNU's dlvsym() and dl_iterate_phdr(), get us a bit further, walking program headers and looking up symbols by address or version. We get further still if we allow ourselves OS-specific features, like Linux's helpful maps file in the /proc filesystem that generated the dump above, together with the auxv file or its associated library call getauxval(). If we're happy to parse the ELF file itself, by doing file I/O, we can get at anything we like.

What's “robustly”?

There are problems with all of the preceding approaches. Portability is lacking from many—only the POSIX APIs and ELF file I/O are widely portable. Robustness is lacking from all of them, in the sense that it may not be safe or reliable to use them from “sensitive contexts” where ELF introspection may be required. The main kind of “sensitive context” is one where making system calls or library calls would cause unwanted reentrancy. For example, if we want access to metadata from instrumentation logic that runs in the middle of a malloc() call, calling out to a library which might itself call malloc() is a bad idea. If we're instrumenting the code paths that make system calls (ask me how!), calling out to the OS might also be a bad idea. These unintended reentrancies can easily cause infinite loops, deadlock or state corruption. There are also security reasons why system calls and library calls are less reliable than working entirely within the process: the filesystem might have changed since the filenames in our maps or link map were recorded, for example.

So, how much ELF introspection can we do without the help of system calls, library calls or non-portable code? The answer surprised me: quite a lot, but not quite everything. There is some low-hanging fruit, then some that we can just about reach (provided a couple of assumptions which, although not strictly portable, probably hold for most platforms), then some that absolutely no portable method can let us get (without resorting to file I/O).

Let's state our problem slightly more precisely. Our address space includes a selection of objects (the executable and all loaded libraries), mapped at particular addresses. We'd first like to enumerate these mappings, giving us a look-up from load addresses to the object's originating binary file (pathname)—more-or-less the structure we saw in the /proc/self/maps listing earlier. Furthermore, each object defines some symbols. We'd like to be able to resolve symbols in each object, as well as any other useful other metadata, like their soname, the libraries they depend on, the locations of support structures like procedure linkage table (PLT), and the relocations performed when they were loaded. This is what we saw in the elaborated ELF file picture. The symbols-and-addresses stuff is the most important; the relocation and miscellaneous information isn't useful as often, but we'd like to have it anyway. Finally, there's some useful metadata that the operating system has passed down about the process, in an ELF-defined structure called the auxiliary vector. It turns out that we'll use this to get at some of the other metadata, but some other parts of it are useful in its own right—they tell us the process ID, page size, and various properties of the hardware.

The link map

To enumerate the mapped objects, we need access to a structure called the link map. The portable interface for walking the link map is the mysterious r_debug protocol. This is surprisingly murky in its origins, but it probably comes from SunOS; it was definitely included in System V release 4 (the only solid reference I know is the relevant Programmer's Guide). It's actually not specified in most ABIs, although some seem to do so. The r_debug protocol defines a structure containing pointers to the link map. From an executable, it's easy to find this structure: we usually have a DT_DEBUG entry in PT_DYNAMIC, and the dynamic linker updates to points to a r_debug structure it allocates. This contains a pair of pointers to the link map, represented as a doubly-linked list.

What about from a shared library? Getting a pointer to the r_debug is trickier here, because shared libraries don't usually have a DT_DEBUG entry. This is a chicken-and-egg problem: we need the executable's r_debug to get to other objects, but if we're not starting in the executable, how do we get there? So here comes our first somewhat-non-portable assumption: that a global symbol _r_debug is available, and marks the r_debug structure in the dynamic linker. This symbol is commonly present, but I don't believe it's standard—POSIX generally specifies programmer-facing APIs, not linker-level symbols. (As we'll see shortly, another method for getting at it is available, but it is also non-portable and imposes some other requirements. Anyway, by combining these two options, we have a pretty good chance that we can find the link map. So, let's continue on the assumption that we have the link map.)


Once we have the link map, we'd like to do symbol lookups in each object. For this, we need (at least) the dynamic symbol tables, which we can find via the PT_DYNAMIC segments of each object. (We probably also want the symbol hash tables; the method is similar.) The link map helpfully keeps a pointer to this specific segment (rather than to the overall program header table). The segment's content consists of key-value pairs, and it's the pair whose key is DT_SYMTAB that points to the dynamic symbol table. (On most ABIs, a short cut to get a pointer to the containing object's PT_DYNAMIC segment is to make a link-time reference to the symbol _DYNAMIC. But this only lets us get the containing object's PT_DYNAMIC, whereas walking the link map gets us all of them.)

One caveat about symbol lookups is that only dynamic symbols are looked up. The other symbols, if there are any (not needed for dynamic linking, but useful for debugging), live in a different table, .symtab, which needn't be mapped at run time. Depending on how your object was linked, perhaps most symbols became dynamic symbols (--export-dynamic) or perhaps only a minimal selection did. If you've ever tried to look up stat() in the GNU C library, using dlsym(), and been surprised not to find it, this is why. The stat() function is one of a few that are always linked statically, even when the rest of the C library is linked dynamically. In general, not all code or data in our dynamic objects is actually described by dynamic symbols. I'll return to this shortly.

The auxiliary vector

The link map gave us a list of loaded objects and their base addresses, but it doesn't tell us where each individual segment is mapped. This is where we begin our voyage into the not-really-portable. In the case of the executable, we can get the program headers via another useful metadata structure: the auxiliary vector. This is created by the operating system to supply various information of use to startup code (typically the bowels of the C library) and the dynamic linker (which, if one is used, also mutates the metadata to make it look as though the executable was loaded directly). (If we failed to get the link map using the previous _r_debug symbol method, we can also use the auxiliary vector to get to the executable's DT_DEBUG, and hence to the link map.) Here's a very nice article all about auxiliary vectors.

The auxiliary vector lives high on the initial thread's stack, a fixed offset above the stack pointer when the program's first instruction receives control. If we could walk the main thread's stack right to the top, we could find the auxiliary vector easily. However, nowadays, since we lack frame pointers, that means making a library call to something like libunwind, and that might allocate memory. Even then, it's not guaranteed that we can walk the stack all the way to the top, since unwind information might be patchy up there. So we'll need a different approach.

I devised a nasty but ultimately pleasing hack for this. When we say “auxiliary vector” we really mean specifically a list of key-value pairs containing the useful metadata I mentioned. But there's also a bunch of other things up there, in a large contiguous structure (see the helpful article for a diagram): the environment strings (a blob of characters), argument strings (ditto), and vectors of pointers into these blobs (our initial environ and argv). Instead of walking the stack, can we get hold of a pointer into one of these blobs, and then parse the surrounding memory to find the start of the key-value pairs? It's very likely that we can, since we can expect to find a handy global variable, environ, that points into them.

Now, it's important to note that this isn't completely robust. The auxiliary vector only records the initial environment. A program can modify the current environment strings to its heart's content by using C library calls like putenv() and getenv(). These will mutate and, if necessary, reallocate the vector pointed at by environ. However, unless a program deletes its entire initial environment, and assuming the C library doesn't eagerly copy things out of it, that vector should always contain one or more pointers into the initial environment blob. So, we want to walk environ and look for such a pointer.

The next problem: how will we recognise such a pointer when we see it? For this, we need to fix some more very-likely-true assumptions. Firstly, we assume either that the initial stack is the highest user mapping in the process, or that if something is higher, it's a shared object (perhaps Linux's vdso) which we can identify in the link map. (Put differently: what's not allowed is for there to be a memory-mapped file above the initial user stack, other than a shared object. This doesn't seem too restrictive; we could even imagine an ABI-level guarantee from the kernel that it will never create such mappings.) This assumption gives us an upper bound to compare our environment pointers against. What about a lower bound? For that, we assume that the caller can supply us one, in the form of a pointer higher than any other stack-allocated string in environ. The address of a local variable in main() could be one such pointer. In fact, any address on the initial stack will probably be good enough, since putting a stack-allocated string in your environment would be a very quirky thing to do. Anyway, we suppose that our caller, the client code which wants to get hold of the auxiliary vector, can give us a such a pointer.

Now we're ready to go. We scan *environ for pointers to strings within these bounds. Then we navigate downwards in memory to find the end of the auxiliary vector, then keep going to find its beginning, after which we'll see the terminator of the argument or environment pointer vector.

Once we have the auxiliary vector, we have the executable's program headers, from which it's easy to get at most other things we need, again using the link map to access other objects.

Program headers

One challenge remains for us: getting at shared objects' program headers. The problem here is that these headers needn't, in principle, be found anywhere in memory. (The ELF format allows a file to require that its program headers are mapped, using PT_PHDR, but it's not required and its use is uncommon.) Typically a dynamic linker will always keep each object's program headers in memory somewhere, but it's not obliged to do so. Even if it does, that memory might be anywhere. The GNU dynamic linker sometimes stores just a malloc'd copy of the data, perhaps originally read through a file descriptor rather than a memory mapping. Either way, the pointer it maintains to the header data lives in private state (in extra non-public fields in the link map structure); there's no portable access to it. So, to get shared objects' program headers there's no avoiding the use of some platform-specific (and loader-specific) code to fish it out.

The major limitation of doing without program headers is that we don't have a complete map of the address space. The link map lets us enumerate objects and their start addresses, and we can then use the symbol tables to do a “forward lookup” from symbols to addresses. We can invert the symbol tables to give us a partial backward lookup from addresses to symbols. But we can't do quite as complete a backward lookup as the dladdr() function can. If you give dladdr() an address that falls within an object's mappings but isn't covered by an exported symbol, it can still tell you which object it's in. Only the program headers contain the information necessary for this. Another thing we can't figure out is which memory access permissions a segment was mapped with—again, that means looking at the relevant program header.

Maximising ELF introspectability

Armed with these observations, we could imagine dynamically rewriting the binaries on our system slightly to optimise for ELF introspectability. Firstly we would insert a PT_PHDR, and define a symbol (maybe _PHDR), roughly analogous to _DYNAMIC), to help us find it. Secondly, going back to the restriction that only dynamic symbols were available for lookup, we could export all .symtab entries as “hidden” dynamic symbols. The obvious objection to this is that it will slow down linking. However, symbol lookup by name happens via the hash table (DT_HASH or GNU_HASH). I haven't checked all the details yet, but it appears that adding them to the symbol table needn't necessarily mean including them in the hash table. There has to be a hash-chain “next pointer” per symbol, but it's permitted not to hook that entry into any bucket. Since link-time lookups occur via the hash table, leaving the hidden symbols out of the hash table seems to defeat the objection. They'd be present for a “deep” introspector to find, by scanning the symbol table. But they wouldn't interfere with the symbol lookup process, nor with code using dlsym().


If you'd like to use this introspection goodness, I've coded it all up in a single header, relf.h, which is fairly self-contained and should be self-explanatory. Please let me know if you find it useful!

[/devel] permanent link contact

Wed, 19 Nov 2014

How to write vaguely acceptable makefiles

It's not hard to write a clear, orderly makefile if you respect it as a programming job in its own right, and adopt some principles. I rarely see such principles written down. Here's what I've gravitated towards as I've learnt to use make more effectively.

[/devel] permanent link contact

Tue, 01 Jul 2014

Drop-in debugging

If you're debugging a problem that occurs in some process deep in a process tree, getting a debugger onto that process can be hard. I recently wrote a script called gdbrun that can help with this. The idea is that for any command cmd args, you can do gdbrun cmd args and it will run the program under a debugger, while externally behaving just like the original program. The exception is if the program doesn't terminate properly: in that case the debugger sticks around for you to interact with. (If you need a more complex condition than “doesn't terminate properly”, you can code that into the script too.)

This isn't ideal—say, if you only want to debug only one invocation of that binary among many. But it works in many cases. (I'm also working on a solution for the harder case, but one thing at a time.) The script is deceptively simple. Sadly it needs GNU bash, but for any non-GNU people it should be easy enough to port.


exec 7<&0
exec 8>&1
exec 9>&2

quote () {
    sed "s/\([\`\\\\\\\"\\\$]\)/\\\\\\1/g"

while true; do
    if [[ -n "$1" ]]; then
        # don't use 'echo' to generate input for quote, otherwise "-e" will be disappear'd
        argstring="${argstring:+${argstring} }\"$( cat <<<"$1" | quote )\""
    ctr=$(( $ctr + 1 ))
    shift || break


# can add -hold here to help debugging
xterm -e gdb \
    --eval-command "file $exe" \
    --eval-command "run $argstring </dev/fd/7 >/dev/fd/8 2>/dev/fd/9" \
    --eval-command "set logging file ${exit_status_file}" \
    --eval-command "set logging on" \
    --eval-command "print \$_exitcode" \
    --eval-command "quit \$_exitcode" </dev/null

inferior_status=$(cat ${exit_status_file} | sed 's/.* = //' )
exit ${inferior_status:-0}

What was hard? Firstly, it took some head-bashing to realise that sharing a terminal between the debugger and inferior wasn't going to work. That's why the script uses xterm—popping up terminal windows for an instant is a minor annoyance, but tolerable. I can't say for certain that a shared-terminal solution isn't possible, but the usual trick of stty -tostop wasn't enough to stop child processes hanging from contention for the terminal, so I gave up.

The second hard thing, and the hardest, was realising that I could thread duplicated file descriptors right down to the inferior, and open them with /dev/fd/7 and friends. This was non-obvious to poor old me. I first tried a solution with named pipes and cat processes forwarding data from/to the gdbrun script's streams to the inferior's streams. After much gnashing of teeth, this turned out to be unworkable. The reason is that pipes are asynchronous channels, meaning they're buffered. A consequence is that even if the inferior doesn't read from its input, some input data will be buffered in the pipe when it gets opened (actually it's opened by the shell that launches the inferior, using gdb's run <infile syntax), and then discarded when the inferior dies. So if you have a command like echo blah |(gdbrun true; cat) your blah will get gobbled into oblivion. The moral of the tale is that introducing cats and pipes is a recipe for mischief.

[/devel] permanent link contact

Sat, 23 Feb 2013

A curiously recurring explanation

In conversation recently(-ish), I tried to explain the infamous Curiously Recurring Template Pattern (CRTP) of C++ by relating it to mixins. That just turned one surprisingly tricky problem into two. Here I will try to rectify both problems by proivding a somewhat coherent explanation of how to realise mixins using CRTP.

/* defining a mixin */
template <class Instantiator>
class my_mixin
	int some_state;
	void invoke_feature() { /* using some_state, somehow */ }

/* "mixing in" a mixin */
class built_from_mixins
 : public 
      my_mixin<built_from_mixins> /* , other mixins ... */
    /* ... */

Unlike normal base classes, a mixin is designed to attach at an arbitrary point in the subtyping hierarchy. It's an orthogonal feature that can be “mixed in” to many potential classes, rather than an increment on a particular class. Therefore, rather than referencing a specific base class, a mixin leaves this reference unbound. Nevertheless, it still gives a name to the class it is extending. Here it is called Instantiator. This can be thought of as a “forward reference” to the class that is “mixing it in”. The name is a placeholder for the class that the programmer will extend using the mixin.

(There are other variations on this theme in other mixin-like constructs. Anyone interested in the many meanings of “mixin” could do worse than to start with Richard Gabriel's essay which is based around this subject— though I note that its actual point about incommensurability is deeper, completely distinct, and very interesting!)

Looking at the code, we see there is a cyclical reference chain: from the mixin user built_from_mixins to a specialisation of the mixin itself my_mixin<built_from_mixins> and back (via Instantiator, which is instantiated to built_from_mixins). These references are a bit strange. We haven't even defined built_from_mixins at the point where we use it to parameterise our mixin instance. Why does the compiler even allow this?

The answer is that of course it's allowed, and the compiler allows it simply by virtue of the usual C++ (and C) rules about incomplete types. Cyclic reference among data type definitions is not unique to mixins. Consider a linked list, where it's no problem to create a recursive “next” pointer inside the list node type. Since pointers-to-incompletes are allowed, and the list node type is just another incomplete type at that location in the code, the code compiles fine.

It takes only a small generalisation to apply this not just to incomplete pointer target types, but more generally to incomplete types used as template parameter instances. Of course we can refer to built_from_mixins inside its own inheritance clause—but we can only do things that we can do with an incomplete type. Using it as a template parameter is one such thing—so long as the template's definition is consistent with its parameter being incomplete. In particular, possible usages of Instantiator inside my_mixin, above, are limited if we want to use the incomplete my_mixin as our Instantiator: we can only do the things we can do with any other incomplete types inside a class definition. Happily, my_mixin's definition sticks to this regime, so is well-formed. Moreover, it itself is a complete data type! (Similarly, if you defined a mutually recursive pair of data types using pointers, whichever one of them you defined first in your source file would be complete immediately, even though it contains a pointer to the yet-to-be-defined second data type.) Being complete, our instantiation of my_mixin is fair game for deriving from. This is what allows us to derive build_from_mixins from my_mixin<built_from_mixins>: the latter is complete, even though its type parameter built_from_mixins (known as Instantiator inside the mixin) isn't.

In fact, we haven't used Instantiator at all inside my_mixin. So, why include it? What can we do with it? Well, we can safely use it in any way we can use any other incomplete type: as a pointer (or reference) target type, or as a type parameter. An example is boost's enable_shared_from_this, a mixin which adds the necessary state and member functions for allowing a class to provide a shared_ptr version of its this pointer. You can't safely create a shared_ptr from a regular pointer in general because you don't know where the target object's reference count lives. The enable_shared_from_this mixin fixes this by embedding a pointer to the refcount, in the guise of a weak_ptr subobject, into the mixing-in class. The guts of enable_shared_from_this are basically as follows.

template<class T> class enable_shared_from_this
    mutable weak_ptr<T> weak_this_;
    shared_ptr<T> shared_from_this()
        shared_ptr<T> p( weak_this_ );
        return p;

Just as in our first example, we have some private state and a public interface which implement an orthogonal feature that can be “mixed in” to any class. The mixin-instantiating class T is referenced only in an incompleteness-tolerant way, to instantiate other templates and (eventually, inside the definition of weak_ptr, which is not shown) to define a pointer target type.

I've also seen CRTP described as “virtual functions without polymorphic behaviour”. What does that mean? Since our mixin has a reference to its instantiating class, it can call its methods—even though the binding to that specific class has not yet been formed. In other words, we have deferred the binding of methods—but not until run time. Rather, we have deferred them to later in compile time, when our templates are elaborated. Let's look at an example.

Let's try to get this deferred binding without using CRTP, but also without using virtual functions. Unsurprisingly, it doesn't work. The best we can do is to try non-polymorphic inheritance, by writing something like this.

class X
	void f() { /* unimplemented */ }
	void g() { f(); }

class Y : public X
	void f() { cout << "Done something"; }

Of course, if we call X::g(), it calls X::f() and not G::f(). Using CRTP, we can get it to call G::f() without resorting to virtual functions.

template <class Derived>
class X
	void f() { /* unimplemented */ }
	void g() { Derived::f(); }
class Y : public X<Y>
	void f() { cout << "Done something"; }

CRTP allows the base class to leave certain functions undefined, for definition later in many possible derived classes. The derived classes are not derived the usual way, though: they are derived using CRTP, passing the derived class back to the base class as a type parameter.

This sets up a now-familiar kind of cyclic reference: the base class refers to its (potentially many) derived classes, through its template parameter. Having this “forward” reference, down the inheritance hierarchy, as well as the usual backward reference up it, is what allows derivation-time binding. It's also limiting: we can't subsequently “re-override” Y::f(). Y's method bindings are fixed at derivation time. We have to create a new specialization of X and derive immediately from that, using some other means to get at Y's functionality if we need it.

Interestingly, note that it's okay for us to do Derived::f() in our X class template. This is surprising because at the point where we instantiate X, Derived is an incomplete type. I mentioned earlier that we were restricted in what we could do with incomplete template parameters, but in this case, here we are happily calling a method of ours. At definition time, there are at least two possibilities for the code that must be generated at the site of our call to Derived::f(), because f() might be an instance member function or a static. (It could also be a function object overloading operator().) When we instantiate X, if we read the code strictly top-to-bottom, we haven't yet got to the body of Y, so it is not yet decided whether it will define f() as a static or an instance member function. Somehow, the compiler is examining looking ahead at the definition of Y at the point where it elaborates X<Y>, even though Y cannot be complete yet (because we're in the middle of elaborating its base class). I must confess, this is where my understanding of the C++ language runs out. Thte standard contains complicated rules about name lookup in templates—principally the infamous “two-phase name lookup”. In short, “unqualified names” in templates are looked up at template definition time, whereas “qualified names” are looked up at instantiation time. Clearly our use of Derived::f() is a qualified name. No doubt there is some part of the standard detailing how the second-phase lookup is permitted (and required) to look into an incomplete type, namely Y in our example (incomplete at the time of X's instantiation), to understand the meaning of Y::f(). I haven't yet found a reference to it though. If anyone can point me to it, I'd be much obliged.

[/devel] permanent link contact

Wed, 27 Jun 2012

32 bits should be enough for anyone

For a brief while, 32-bit Intel machines were the de facto standard in commodity hardware , and life was simple. Certainly, it's an ugly architecture, gigantically overcomplicated by backwards-compatibility. Its virtual addressing features are terrifying. But the subset of it which user-level programmers on modern OSes use is fairly comprehensible. There is an ISA-defined stack with its own registers and a well-defined calling convention. Pointers and most integers are both 32 bits, meaning that the “word” is a useful and well-understood unit of storage.

All this changed in the early 2000s as AMD's 64-bit extension of the ISA came into popularity. Upon us were forced bigger integers, bigger pointers, and a more complicated stack and calling convention (in the name of “performance”, but at huge cost in complexity). I believe these were completely retrograde steps. Now that pointers are 64 bits, our software's memory footprint and disk footprint are bloated considerably. To “alleviate” this, and to avoid certain paradoxes about the size relationships between short, int and long, an int in most C compilers stayed at 32 bits. Unfortunately, this is completely braindead, because int is very much supposed to be a word-sized type. This is the reason that C's “defaults to int” semantics, as applying to unprototyped functions and untyped variables, are sane.

Does this matter? Yes! Here is some code that was mysteriously segfaulting for me this morning. It's from DTrace, or more specifically, Paul Fox's Linux port of it.

if ((P->rap = rd_new(P, P->pid)) != NULL)
  (void) rd_loadobj_iter(P->rap, map_iter, P);

Somehow, the pointer returned by the rd_new---which just wraps a simple calloc() call---gets corrupted immediately after returning. Suspiciously, said corruption is that the top four bytes are 0xffffffff, whereas the lower four bytes are those of the pointer returned by calloc(). Inspecting the disassembly around the rd_new call, we see something suspicious.

   0x000000000047ed12 <+150>:   callq  0x462bc6 <rd_new>
=> 0x000000000047ed17 <+155>:   cltq   

What's this cltq thing? Well, it takes the lower 32 bits of %rax (the 64-bit register holding the return value from rd_new()) and sign-extends them to fill the full 64 bits. This is exactly the corruption I was seeing. Why did the compiler insert this unwanted instruction? The answer is revealed if we recompile the file with -Wall.

Psymtab.c:645:3: warning: implicit declaration of function `rd_new' [-Wimplicit-function-declaration]

The function is not prototyped, so its return value defaults to int. But because int is now 32 bits wide, and the register holding the return value is 64 bits wide, the compiler helpfully obliterates the top 32 bits of the return value for us by sign-extending the lower 32 bits into them. If the compiler implementors had stuck with the intention of the int data type, that it be exactly a word in size, and therefore that “defaults to int” is sensible, this would not have arisen.

Now, this is clearly sloppy code. We should just fix it so that rd_new() is prototyped. It probably seems a bit of a nonsequitur that I am blaming this problem on 64-bit architectures. But on the other hand, how often in your code have you actually wanted integers that can store values in excess of 232? If you are a systems programmer, you might value the ability to encode large offsets. But we already had long long for this. In other cases, the vast bulk of our code deals with small integers, characters and pointers. Giving us an extra 32 bits of width in our ALU operations is a waste of transistors.

Why did we waste them this way? Well, we had to waste them somehow. In the early 2000s, we didn't really know what else to do with them, because (I suspect) there was little perceived demand for multiple cores in the commodity market (outside of servers). Nowadays, we have even more transistors, and even hardware guys realise that giving us 128-bit architectures would be a pointless waste. So, they spent some effort convincing us that we really did want multiple cores after all. And now we are busy complicating our software so that we can “exploit” these too. I have ranted before about how that risks generating a generation's worth of bad software. Perhaps I should say “another generation's worth”.

By the way, I'll concede that 64-bit address spaces can be useful, if they are used to support persistence or sharing. No need for pointer swizzling! But AMD's 64-bit x86 extensions do not provide the separation of protection from mapping to realise the sharing use-case. In other words, switching protection domains still means invalidating the TLB entries of shared mappings. Meanwhile, I haven't seen anyone using the extra address space for accessing persistent storage in a radically new way, although I'd love to see some approaches in this space.

I don't completely doubt the value of multiple cores either. The right way to see parallelism is as an enabler for radically more computation-intensive applications---likely to be in domains such as scientific computation, machine learning, and simulation---than what we can currently support. As I have also ranted about before, I am deeply disturbed by the fervour for mass rewriting of everyday software, and the disruption to the infrastructure it runs on, that is resulting from mindless multicore mania, in the same way that the 64-bit shift has disrupted our infrastructure. It's all in the name of performance, but it costs us far more of human beings' time and energy than it saves.

[/devel] permanent link contact

Fri, 01 Jun 2012

Link order

Initialization order of static state is a thorny problem. It's particularly tricky to get right portably. But until recently I didn't realise how tricky it could be even when restricting oneself to GNU tools on Unix platforms. Consider the following three-part program, consisting of an executable prog and two shared libraries lib1 and lib2. The dependency order is left-to-right in that list: prog depends on lib1 which depends on lib2.

/* prog.c */
#include <stdio.h>

/* from lib1 */
void greeting(void);

/* constructor */ 
static void init(void) __attribute__((constructor));
static void init(void)
	fprintf(stderr, "Initializing prog\n");

int main(void)
	return 0;

/* end prog.c */

/* lib1.c */
#include <stdio.h>

/* from lib2 */
void hello(void);

/* constructor */ 
static void init(void) __attribute__((constructor));
static void init(void)
	fprintf(stderr, "Initializing lib1\n");

void greeting(void)

/* end lib1.c */

/* lib2.c */
#include <stdio.h>

/* constructor */ 
static void init(void) __attribute__((constructor));
static void init(void)
	fprintf(stderr, "Initializing lib2\n");

void hello(void)
	printf("Hello, world!\n");

/* end lib2.c */

Here's a GNU Makefile to tie it all together.

CFLAGS := -g -fPIC
LDFLAGS := -L$$(pwd) -Wl,-R$$(pwd)
LDLIBS := -l1 -l2

default: lib1.so lib2.so prog

%.so: %.c
	$(CC) $(CFLAGS) -shared -o "$@" "$<"

	rm -f lib1.so lib2.so prog

Now when you do make (or gmake) it will build a program that initializes its libraries in right-to-left order: from the “most depended on” to the “least depended on”. We can verify this by running the program.

$ ./prog
Initializing lib2
Initializing lib1
Initializing prog
Hello, world!

Moreover, if you try flipping around the link order in the LDLIBS line, the link will fail with undefined reference to `hello', because the reference to hello (in lib1) is introduced after the reference to lib2, and the linker's defined behaviour is to avoid re-scanning for new undefined references---it's up to the invoker to order the libraries so that this works.

Let's try this on a BSD system. I have a NetBSD 5.0 VM hanging around, so I'll try that. It has recent GNU make, GCC and GNU binutils installed.

$ gmake
cc -g -fPIC -shared -o "lib1.so" "lib1.c"
cc -g -fPIC -shared -o "lib2.so" "lib2.c"
cc -g -fPIC  -L$(pwd) -Wl,-R$(pwd)  prog.c  -l1 -l2 -o prog
$ ./prog
Initializing lib1
Initializing lib2
Initializing prog
Hello, world!

Strangely, our initialization order is flipped. This doesn't matter for our program, but if lib1 consumed some static state in lib2, it would matter quite a bit. What happens if we flip the link order around to compensate? We edit the LDLIBS line and re-make.

$ nano Makefile
$ gmake clean && gmake
rm -f lib1.so lib2.so prog
cc -g -fPIC -shared -o "lib1.so" "lib1.c"
cc -g -fPIC -shared -o "lib2.so" "lib2.c"
cc -g -fPIC  -L$(pwd) -Wl,-R$(pwd)  prog.c  -l2 -l1 -o prog
$ ./prog
Initializing lib2
Initializing lib1
Initializing prog
Hello, world!

This has done what we want. But what's going on? This link order didn't even work on GNU/Linux. Not only does it work on BSD, but it's required if we want a sensible initialization order. Our initializers run in left-to-right order, so we need to put the “most depended on” libraries first, not last. This isn't a BSD quirk per se, because we're using the GNU linker in both cases. I suspect the linker scripts are nevertheless different in the two cases. However, I haven't had time to look into the details of why. I'd be interested to hear, if anyone knows. I guess this is the sort of pecularity that gives libtool a reason to exist.

[/devel] permanent link contact

Tue, 13 Dec 2011

Load addresses

For reasons I will only hint at, I want to predict the load address of a set of shared libraries, given an executable that links against them. Naturally, I have turned off address space layout randomization.

At first, I thought I could use ldd for this. It seems to work.

$ ldd /usr/local/src/git-
        linux-vdso.so.1 =>  (0x00007ffff7fdd000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7dc3000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff7ba5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7806000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fde000)

But also, there is an environment variable called LD_TRACE_LOADED_OBJECTS that is supposed to have the same effect. As it happens, ldd is just a small wrapper script which sets this variable and invokes the dynamic linker, which on my system is /lib64/ld-linux-x86-64.so.2. Let's try doing this directly.

$ LD_TRACE_LOADED_OBJECTS=1 /usr/local/src/git-
        linux-vdso.so.1 =>  (0x00007ffff7ffb000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7bc4000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff79a7000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7608000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7ddc000)

That seems to work too. But wait! It's given us different load addresses than ldd did. Have I really turned off randomization? Well, yes. In fact, repeating either of these commands will reliably yield the output above, and they are reliably different from one another. What is going on?

Let's hack ldd so that it prints exactly what command it is going to execute.

$ ldd /usr/local/src/git-
About to eval:  LD_TRACE_LOADED_OBJECTS=1 LD_WARN= LD_BIND_NOW= LD_LIBRARY_VERSION= LD_VERBOSE= /lib64/ld-linux-x86-64.so.2 /usr/local/src/git-
verify_out is 
        linux-vdso.so.1 =>  (0x00007ffff7fdd000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7dc3000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff7ba5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7806000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fde000)

So, it has set a bunch of other environment variables to empty strings. They look innocuous enough. But also, it is invoking the loader directly, whereas we were just letting execve call the loader for us. Can we reproduce the result of ldd by running the same command it does?

$ LD_TRACE_LOADED_OBJECTS=1 LD_WARN= LD_BIND_NOW= LD_LIBRARY_VERSION= LD_VERBOSE= /lib64/ld-linux-x86-64.so.2 /usr/local/src/git-
        linux-vdso.so.1 =>  (0x00007ffff7fdd000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7dc3000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff7ba5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff7806000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7fde000)

Yes, we can. Now, the big question: which one is correct? Let's run our program under gdb and inspect the memory map.

$ gdb --args /usr/local/src/git-
... snipped ...
(gdb) break main
Breakpoint 1 at 0x404bf0: file git.c, line 509.
(gdb) run
Starting program: /usr/local/src/git- 
[Thread debugging using libthread_db enabled]

Breakpoint 1, main (argc=1, argv=0x7fffffffddd8) at git.c:509
509     {
(gdb) print getpid()
$1 = 27023
(gdb) shell cat /proc/27023/maps | grep 'lib.*\.so.*'
7ffff7608000-7ffff779d000 r-xp 00000000 08:07 356590                     /lib/x86_64-linux-gnu/libc-2.13.so
... snipped ...

So, libc-2.13.so has been loaded at address 0x7ffff7608000, which is what we got from running with just the LD_TRACE_LOADED_OBJECTS flag set, and not what we got with ldd or with invoking ld-linux.so.2 specially.

Why the difference? Clearly, first execveing the loader perturbs the address assignment. It's not clear why this should be---isn't the loader itself the first thing to be loaded anyway? I'm not yet sure what is going on.

Another question: is predicting the load address even a sound thing to do? Given that we had to disable randomization in the first place, it seems like a bad idea. In my case, this approach will do for now, but ultimately I should defer my work until application start-up time. Then we can discover the actual loaded addresses of the various libraries, which is much more robust.

[/devel] permanent link contact

Thu, 01 Dec 2011

Weak dynamic symbols

Although I know more than the average bear about linkers, there's always things I don't know. Until now I never had cause to understand the following: how does the linker know which symbols to link dynamically, and which to link statically?

A bit of head-scratching reveals the only possible answer. Given a command-line, it does the usual thing: look at the command-line options, perform the standard library look-up procedure, gathering a list of object files---some static, some dynamic. If a symbol is defined by a dynamic library, make it dynamic. Otherwise, it stays static.

That sounds fairly sensible. But it can mean surprises. Suppose you have a C program that wants to support an optional feature to be linked in at load time. You might write something like the following.
int optional_function(int arg1, void *arg2) __attribute__((weak));

/* ... */

void do_something(void)
    if (optional_function) optional_function(42, &some_obj);
    /* else skip the optional part... */

If you pull this sort of trick within a shared library, it works fine. But inside an executable: no dice! If you compile this into an executable and look for optional_function in your dynamic symbol table, you'll be disappointed.

$ objdump -T my-program | grep optional_function

What is going on? Well, it's in the static symbol table, silly.

$ objdump -t my-program | grep optional_function
0000000000000000  w      *UND*  0000000000000000          optional_function

What does it mean to have an undefined symbol in your executable's static symbol table? It means it will silently take the value zero! In fact, the relocation records referencing your symbol have already been discarded.

$ objdump -rRd my-program | grep -A1 -B1 callq
  400549:      bf 2a 00 00 00        mov    $0x2a,%edi
  40054e:      e8 ad fa bf ff        callq  0 <__init_array_end>
  400553:      b8 00 00 00 00        mov    $0x0,%eax

Cheerily, the linker has inserted a direct-bound call to address zero in your code. That's not what we want! So, how can we fix it?

The trick is in the linker's (or at least the GNU linker's) --dynamic-list option. First, create a file called whatever you like (mine's called dynamic-list), containing the following.

{ optional_function; };

Now link your program passing --dynamic-list <your-dynamic-list> to the linker.

gcc -Wl,--dynamic-list -Wl,<your-dynamic-list> -o my-program my-program.c

Hey presto! You should now have your weak symbol in the dynamic symbol table.

$ objdump -t my-program | grep optional_function
0000000000000000  w   D  *UND*  0000000000000000          optional_function

That's a bit ugly. Recalling the linker behaviour I described at the beginning, the simpler way to do it is just to link your executable against a shared library defining optional_function.

You might wonder (as I do): what is the point of putting undefined symbols in an executable's static symbol table? Once the executable is output, it's too late to link anything with them. Surely they should all be “promoted” to dynamic symbols? [Update, 2012-5-19: there is of course a linker option for doing this, which in the GNU case is --export-dynamic. Still, I'm not sure why it isn't the default.]

It would also be nice to have an objcopy option for adding dynamic symbols in this way, so we can do it after the fact, rather than changing the linker command like we did above. However, this is nontrivial for the reason I mentioned---the relocation records that you would want have already been eliminated. So, we would need to re-create them. This is similar to something I began work on before. At some point I might resurrect my objcopy patches and try to repurpose them to this problem. For now, I will just hack in the extra linker options.

[/devel] permanent link contact

Thu, 06 Oct 2011

LLVM structural typing

I'm learning about the LLVM compiler infrastructure at the moment.

LLVM bitcode includes a notion of data types. These are used to control implicitly the size and encoding of values generated by various operations, to hint about mappings to underlying machine data types (e.g. on architectures that distinguish floating-point from integer registers) and to implicitly cause certain transformations to be effected, such as padding or sign extension. (I'm not yet sure whether all such operations need to be explicitly rendered as an LLVM “bitcast” operation or not. At least, LLVM's notion of types can be used to define validity of these operations, whether or not they happen implicitly.)

Moreover, addresses (pointers) are typed according to the type of the values they reference (point to). The data types are in this sense higher-order. (This is a weaker case of “higher-order” than types specifying the behaviour of functions. But it has some things in common. I will blog about this more in the future.) These data types control implicitly how much data is read or written by indirect loads and stores.

A typical C front-end will encode C data types directly into this type system. However, this is just a convenience. The encoding discards some of the semantics of the C type system, because in LLVM, composite types are treated purely structurally, whereas in C, they are always treated nominally. Consider this program.

#include <stdlib.h>

struct Foo {
  int a;
  int b;

struct Bar {
  int x;
  int y;

int main(void)
  struct Foo *f = (struct Foo *) malloc(sizeof (struct Foo));
  struct Bar *b = (struct Bar *) malloc(sizeof (struct Bar));


  return 0;

In LLVM bitcode, using llvm-gcc, we get the following.

; ModuleID = 'test.o'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-f128:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

%struct.Bar = type { i32, i32 }
%struct.Foo = type { i32, i32 }

define i32 @main() nounwind {
  %retval = alloca i32
  %0 = alloca i32
  %f = alloca %struct.Bar*
  %b = alloca %struct.Bar*
  %"alloca point" = bitcast i32 0 to i32
  %1 = call noalias i8* @malloc(i64 8) nounwind
  %2 = bitcast i8* %1 to %struct.Bar*
  store %struct.Bar* %2, %struct.Bar** %f, align 8
  %3 = call noalias i8* @malloc(i64 8) nounwind
  %4 = bitcast i8* %3 to %struct.Bar*
  store %struct.Bar* %4, %struct.Bar** %b, align 8
  %5 = load %struct.Bar** %f, align 8
  %6 = bitcast %struct.Bar* %5 to i8*
  call void @free(i8* %6) nounwind
  %7 = load %struct.Bar** %b, align 8
  %8 = bitcast %struct.Bar* %7 to i8*
  call void @free(i8* %8) nounwind
  store i32 0, i32* %0, align 4
  %9 = load i32* %0, align 4
  store i32 %9, i32* %retval, align 4
  br label %return

return:                                           ; preds = %entry
  %retval1 = load i32* %retval
  ret i32 %retval1

declare noalias i8* @malloc(i64) nounwind

declare void @free(i8*) nounwind

Notice that although the compiler has emitted two LLVM type definitions, one for each of our struct types, it then proceeds to use only the first one of them. The second is redundant, because the two are structurally equivalent. This starts to look even more peculiar when we make our data types recursive.

#include <stdlib.h>

struct Foo {
  int a;
  int b;

struct Bar {
  int x;
  int y;

struct FooRecursive {
  int a;
  struct FooRecursive *next;

struct BarRecursive {
  int a;
  struct BarRecursive *next;

int main(void)
  struct Foo *f = (struct Foo *) malloc(sizeof (struct Foo));
  struct Bar *b = (struct Bar *) malloc(sizeof (struct Bar));

  struct FooRecursive *fr = (struct FooRecursive *) malloc(sizeof (struct FooRecursive));
  struct BarRecursive *br = (struct BarRecursive *) malloc(sizeof (struct BarRecursive));
  return 0;

This gives us the following.

; ModuleID = 'test.o'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-f128:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

%struct.Bar = type { i32, i32 }
%struct.BarRecursive = type { i32, %struct.BarRecursive* }
%struct.Foo = type { i32, i32 }
%struct.FooRecursive = type { i32, %struct.BarRecursive* }

define i32 @main() nounwind {
  %retval = alloca i32
  %0 = alloca i32
  %f = alloca %struct.Bar*
  %b = alloca %struct.Bar*
  %fr = alloca %struct.BarRecursive*
  %br = alloca %struct.BarRecursive*
  %"alloca point" = bitcast i32 0 to i32
  %1 = call noalias i8* @malloc(i64 8) nounwind
  %2 = bitcast i8* %1 to %struct.Bar*
  store %struct.Bar* %2, %struct.Bar** %f, align 8
  %3 = call noalias i8* @malloc(i64 8) nounwind
  %4 = bitcast i8* %3 to %struct.Bar*
  store %struct.Bar* %4, %struct.Bar** %b, align 8
  %5 = call noalias i8* @malloc(i64 16) nounwind
  %6 = bitcast i8* %5 to %struct.BarRecursive*
  store %struct.BarRecursive* %6, %struct.BarRecursive** %fr, align 8
  %7 = call noalias i8* @malloc(i64 16) nounwind
  %8 = bitcast i8* %7 to %struct.BarRecursive*
  store %struct.BarRecursive* %8, %struct.BarRecursive** %br, align 8
  %9 = load %struct.Bar** %f, align 8
  %10 = bitcast %struct.Bar* %9 to i8*
  call void @free(i8* %10) nounwind
  %11 = load %struct.Bar** %b, align 8
  %12 = bitcast %struct.Bar* %11 to i8*
  call void @free(i8* %12) nounwind
  %13 = load %struct.BarRecursive** %fr, align 8
  %14 = bitcast %struct.BarRecursive* %13 to i8*
  call void @free(i8* %14) nounwind
  %15 = load %struct.BarRecursive** %br, align 8
  %16 = bitcast %struct.BarRecursive* %15 to i8*
  call void @free(i8* %16) nounwind
  store i32 0, i32* %0, align 4
  %17 = load i32* %0, align 4
  store i32 %17, i32* %retval, align 4
  br label %return

return:                                           ; preds = %entry
  %retval1 = load i32* %retval
  ret i32 %retval1

declare noalias i8* @malloc(i64) nounwind

declare void @free(i8*) nounwind

Notice that the self-referencing structure of FooRecursive has been lost, again because a different type is structurally equivalent.

Now for a final experiment: what about singleton structs? Are they structurally equivalent to a single element? I'll throw in a typedef too, to see whether that appears.

#include <stdlib.h>

struct Foo {
  int a;
typedef int Baz;

int main(void)
  struct Foo *f = (struct Foo *) malloc(sizeof (struct Foo));
         Baz *b =        (Baz *) malloc(sizeof        (Baz));
  return 0;

Here's the code it generates.

; ModuleID = 'test.o'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-f128:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

%struct.Foo = type { i32 }

define i32 @main() nounwind {
  %retval = alloca i32
  %0 = alloca i32
  %f = alloca %struct.Foo*
  %b = alloca i32*
  %"alloca point" = bitcast i32 0 to i32
  %1 = call noalias i8* @malloc(i64 4) nounwind
  %2 = bitcast i8* %1 to %struct.Foo*
  store %struct.Foo* %2, %struct.Foo** %f, align 8
  %3 = call noalias i8* @malloc(i64 4) nounwind
  %4 = bitcast i8* %3 to i32*
  store i32* %4, i32** %b, align 8
  %5 = load %struct.Foo** %f, align 8
  %6 = bitcast %struct.Foo* %5 to i8*
  call void @free(i8* %6) nounwind
  %7 = load i32** %b, align 8
  %8 = bitcast i32* %7 to i8*
  call void @free(i8* %8) nounwind
  store i32 0, i32* %0, align 4
  %9 = load i32* %0, align 4
  store i32 %9, i32* %retval, align 4
  br label %return

return:                                           ; preds = %entry
  %retval1 = load i32* %retval
  ret i32 %retval1

declare noalias i8* @malloc(i64) nounwind

declare void @free(i8*) nounwind

Predictably, the typedef has gone away entirely, because it introduces no new structure. However, our singleton struct has stayed around. This isn't surprising either, because LLVM has instructions for accessing field members, whose semantics are affected by these structural differences. Composite types are not just sugar for arrays of bytes or words.

This does mean that if we wanted to encode nominal types into our LLVM bitcode, we could do it by wrapping nominally distinct types in differing depths of layered singleton structs. This would affect the bitcode that came out, e.g. inserting extra GetElementPtr operations, but shouldn't affect the optimised compiler output.

Overall, we can say that LLVM's data types are an annotation useful primarily for propagating and/or inferring data size and encoding information contextually through load and store operations. They are also used for checking: the bitcode is checked for first-order type errors. Since they are raised at the point of pointer use (e.g. a pointer assignment without appropriate bitcast is an error), they can catch likely first-order errors early, i.e. when a pointer is generated or stored, rather than later when it is dereferenced and its target data is misinterpreted. Here by “first-order type errors” I roughly mean those at the machine level, meaning they always concern the misinterpretation of bit-patterns encoding primitive values (integers, addresses, floating-point numbers). Since nominally distinct data types are conflated when they are structurally equivalent, then without using the encoding trick I mentioned, the bitcode will not capture, say, that one variable encodes polar coordinates by x and r, and another by r and theta. Detecting violation of these abstractions is beyond the reach of any analysis based (only) on LLVM data types using the standard encoding.

This includes dynamic analyses. Right now I'm writing an is_a function for KLEE. KLEE is (effectively) a fancy LLVM interpreter. Without recourse to the source code and/or hacking the C front-end, we can only do structural is_a, which is slightly disappointing. I should add that I'm not criticising LLVM here at all. Intermediate representations are not the place for fancy type systems, and the structural approach works nicely. It just means more work for me, when it looked for a moment as though I could abuse pre-existing functionality.

[/devel] permanent link contact

Wed, 01 Jun 2011

Memtable again

I've finally got a nicely packaged implementation of memtables, the data structure I introduced in a previous blog post. It's in a single header---memtable.h. I've also fixed a couple of stupid bugs that crept into malloc_hooks.c just before I released the initial version. You can see an example of combining these two in heap_index_hooks.c---which you can compile into a simple LD_PRELOADable shared library that will instrument the glibc malloc to keep an indexed table of allocated chunks, keyed on address. It's pretty easy to search the memtable to find the heap chunk for a given pointer anywhere into an object. I'll integrate this into my Cake runtime implementation soon, and the whole lot will appear here in due course (i.e. eventually).

If you use any of these files, please drop me a line to say how you got on---I'd really appreciate it.

[/devel] permanent link contact

Thu, 19 May 2011

Namespace problems

It's always been a theoretical problem with C that there is no namespacing. I'd often wondered how much of a practical problem this really was, with “nothing major” my tentative answer. I've finally run into my first bona-fide gotcha arising out of this problem. In short: wxWidgets and GLib both define a GSocket.

Annoying as this is, it wouldn't be in my top ten grumbles about the experience of programming in C. A far bigger related problem is versioning. This doesn't get cited as a weakness of C because it's also a weakness of most other programming languages. The very reason I ran into the namespace problem was because I had to compile wxWidgets 2.6, rather than using the 2.8 revision that's packaged for my distribution. Version mismatches can be seen as namespace collisions too. Instead of getting the version you want, the namespace has been populated with slightly different stuff that is, despite its close relationship to what you actually require, still incompatible, much the same as if the namespace were polluted with random third-party stuff.

Versioning issues could perhaps be brought more under the programmer's control. Most programming languages don't have an explicit notion of “version” when importing stuff. But when explicitly consuming some target API, you are always assuming at least something about its version. Having the programmer declare which version of a set of declarations they want to import would be straightforward. In C, it could even be done quite neatly with just the preprocessor---say, #define __LIBFOO_REQUESTED_VERSION 4.2) before the relevant #include.

Of course, pessimistically refusing to link across nominal mismatches of version would be a bad solution. We want a more structural and, indeed, behavioural or “semantic” approach. With the C preprocessor approach I outlined, it becomes the header file author's responsibility to embed a test about which prior API version the associated implementation is compatible with, most likely using a simple #if test. This responsibility is not unreasonable I'd say---the developers are in the best place to say what has changed with a new revision. And since it's in a header file, if the maintainers are lazy, the client programmer can override it.

One shortcoming of this approach is that the client programmer might be too lazy to work out which is the earliest library version their code will work with, and will instead select whatever version they are developing with. This is safe, but prevents some valid compositions. On a different system with a slightly older version of the library, the header might conservatively conclude that it's not compatible with the client, even though it could work. Anyway, I don't worry about this too much. Lots of researchers have thought about versioning before, so there's probably some good solutions knocking around.

Back to the sockets example, it's perhaps unsurprising that the name collision occurred when linking two chunks of infrastructure code. Name collisions are most likely when abstracting the same domain, having the same natural language vocabulary---namely sockets in this case. This is much more likely to happen in infrastructure software (i.e. modelling system resources) than application level software (modelling circles or ellipses or airline reservations or health records and so on), simply because you're less likely to link multiple instances of the latter together. Whereas application-level code is at or near the top of the software dependency graph, the infrastructure stuff is lower down so more likely to get sucked into a program through dependency.

I was interested to note Nick Nethercote's recent blog entry about a problem with duplication (generally) and bloat (specifically) associated with multiple wrapper layers for system calls and other nonportable interfaces. He was talking about mmap(), but the socket abstraction is another example. I have some research proto-ideas that might help with this problem. Essentially I'm interested in recovering a more finer-grained style of interface description from code, based on the idea of “relational interfaces”. You could then use this description to infer that two sets of functions had very similar behaviour, and factor out the duplication (with appropriate refactoring or adaptation tools).

This whole problem is another consequence of our fragile direct-interfacing, in-order methods for constructing of software. If we had a more flexible way of constructing software, the problem wouldn't arise. Rather than slavishly building on predefined interfaces that are specific to one underlying component---like one mmap() abstraction layer, or one socket abstraction--- we need smarter tools for specifying our requirements abstractly and finding customised ways of satisfying them using a range of “found” code. This is what my Onward! '09 proto-paper was ranting about. I guess it's good that I'm still ranting. Interface hiding is as good an idea as ever, and more work on it will happen, when I get time....

[/devel] permanent link contact


At the MMNet workshop in Glasgow last week, I talked about memtables. These are an efficient associative data structure, built using virtual memory support on modern OSes (currently implemented for Linux only), that are useful whenever you want to key a table on addresses in memory. See my slides for more.

Since entries with numerically similar keys are stored close to each other, memtables are, like certain other associative data structures, amenable to searching within a key range as well as exact-match lookups. By contrast, hash tables can't do this. (That said, a hash table supporting duplicate keys can be used to store items grouped into small equivalence classes. This is sometimes good enough, and could be made to work in my case. Nonuniform key duplication will mess up the O(1) nature of hash tables though.)

Memtables seem like they could be useful in lots of places. I invented them for DwarfPython as a fast way of storing and retrieving metadata given a key that may be an interior pointer (hence the searching requirement). I'm also (soon) using them in Cake as a fast way of tracking what objects are associated with what other objects.

The key space doesn't have to be addresses. It's possible we could even use memtables for free chunk binning, since large sizes are sparsely used. I need to do some more experiments to establish this.

The implementation comes in two parts:

  • A generic set of malloc hooks for glibc: these hooks aree “generic” in that they're designed to be easily specialised for various conceivable kinds of instrumentation. They're not generic with respect to the allocator---sadly they're specific to glibc, but most mature allocators should have some similar mechanism. The key usefulness in these hooks is factoring the various cases of the malloc API ---specifically the complex behaviour of realloc, but also other annoyances including memalign and null frees--- into an easy-to-use set of higher-level hooks. These are likely (but not guaranteed) to be a better match for whatever your instrumentation is doing. For example, defining a post_successful_alloc() function will hook all events that allocate a new heap block, whether they originated in a malloc(), a realloc() or a memalign().
  • a generic memtable library: this will appear soon! It's a set of hook definitions that maintain a memtable, and a lookup function.
  • Memtables are strictly faster than a hash table, at least for lookups, because they are basically a hash table without the hashing. At least for most applications of memtables, the table itself acts as an index for a bunch of linked lists---call them bins or buckets. Rather than mapping the key space onto a smaller hash space in order to keep the index small, we index directly by the key, and rely on the virtual memory trick to keep the index small. Since we can only save page-sized chunks of space, the key-space really needs to be used in a clustered and/or very sparse fashion. Just one used key per page of index is enough to allocate the whole table in physical memory, which we don't want. So if your table has four-byte entries, say, uniform key usage should be a lot less than one per thousand possible keys---but clusters of many are okay, so long as they're thousands apart.

    [/devel] permanent link contact

    Tue, 22 Mar 2011

    How much memory could an mmap() map...

    ...if an mmap() could map memory? I've had to ask this question of Linux on 64-bit x86 platforms recently.

    For reasons I will only hint at, I want to allocate a huge region of virtual address space for a structure a bit like a linear page table (called a “virtualized page table” on Wikipedia). We rely on a particular behaviour of Linux's mmap(): that mapping some memory isn't the same as committing to the availability of any underlying physical memory. Passing the MAP_NORESERVE flag means that memory will only be allocated when written to, hence allowing us to create a sparse table easily.

    I decided my table should have one word per 4KB of memory. For a 32-bit machine, which has 4-byte words and 220 such (aligned) regions, this means I need 4MB of virtual address space for my table (i.e. about a thousandth of the VAS). If we ask mmap() for such a region, it will clearly oblige us. On a 64-bit machine, which has 8-byte words and 252 such regions, I need 255 bytes of virtual address space for my table---32 petabytes, or about eight billion times as much as in the 32-bit case, but again, only a small fraction of the total address space (in this case a 512th, because words are twice as big).

    Here's a quick program you can run to test whether you can do an mmap() of a given size.

    #include <stdio.h>
    #include <errno.h>
    #include <string.h>
    #include <stdlib.h>
    #include <sys/mman.h>
    #include <assert.h>
    int main(int argc, char **argv)
        assert(argc > 1);
        size_t mapping_size = strtol(argv[1], NULL, 0);
        assert(errno != ERANGE);
        assert(mapping_size > 0);
        assert(sizeof(size_t) == 8);
        void *ret = mmap(NULL, mapping_size, PROT_READ|PROT_WRITE, 
        if (ret == MAP_FAILED)
            fprintf(stderr, "error: %s\n", strerror(errno));
            return 1;
            fprintf(stderr, "success!\n");
            return 0;

    And here's a shell script to drive it with powers of two until it fails.

    for exp in `seq 10 50`; do
        ./test $(( 2 ** $exp )) || break;
        echo "Succeeded for 2 ^ $exp"

    I'd be curious to know whether anyone on an x86-64 Linux machine maxes out anywhere different than 246 bytes. The kernel source will have the answer, but I can't be bothered wading through it right now. Interestingly, turning off the overcommit check (i.e. writing "1" to /proc/sys/vm/overcommit_memory) doesn't increase the limit for me.

    By the way, I'm using strtol because atol seemed to be narrowing the result to 32 bits even though a long is 64 bits. Instead of 231 I got -231, which unsurprisingly made mmap() fail. This seems like a bug, but probably isn't for some reason (possibly including a stupid mistake by me).

    As you might have guessed, I'm using this huge region of memory as a big flat structure to record metadata about memory. The main trick of a linear page table is that we can use virtual memory to encode large sparse arrays, without allocating memory for page-sized regions of the table that are empty. This generalises to sparse tables other than page tables. The one I'm building is for tracking allocated heap regions.

    [Update, 2011-3-23: thanks to Malcolm Scott who pointed out that my problem might be more tractable, because current x86-64 processors only implement a 48-bit address space. This also means that the 46-bit limit makes more sense---my mmap() successfully allocated a quarter of the usable virtual address space! Now I'm wondering: are those 48 bits something I can rely on for the nominal x86-64 architecture, or will running the same binaries on future hardware silently issue addresses from the larger 64-bit space? For now it doesn't really matter, but send answers if you have them (on a postcard, if you like) please.]

    [/devel] permanent link contact

    Fri, 19 Feb 2010

    C++ Gotme number 4: constant surprise

    When I was less experienced with C++, I would avoid creating any sort of constness in my code because it invariably led to headaches. When I'd tried, all too often I'd found myself trawling through the source code either adding or removing constness all over the place in order to make things compile. It's certainly true that constness has to be done right or not at all. Like a lot of C++, it's not resilient either to mistakes or to change.

    These days I usually make a reasonable job. I've found the best way to understand constness is as follows. Every object in a C++ program exposes exactly two variants of its interface: the “full” one and the const one. The actual boundary between these, at least in the case of user-defined types, is somewhat flexible (as witnessed by mutable, and by your forgetfulness to add const to method definitions!).

    Grokking the difference between a const pointer and pointer-to-const is of course a rite of passage in C and C++ programming. A more obscure subtlety about const in C++ is that const references let you pass around temporaries' lvalues, but you can't do this in a non-const fashion. It's not clear why not, except that it's something you'd rarely want to do ---unless, of course, you had omitted to add the const annotation in the signature of whatever function you wanted to pass the lvalue to. That's one reason why getting const right is something you can't really opt out of.

    Today I was wrestling with an unfortunate side of const---untraceable compilation errors. I'm using the boost graph library so I can run graph algorithms over my DWARF data structure. Since all my “nodes” live in a std::map, I was specialising the graph_traits giving the map's value_type as the node type. Here's what the compiler said.

    ude/c++/4.4.3/bits/stl_pair.h: In member function ?std::pair<const long long uns
    igned int, boost::shared_ptr<dwarf::encap::die> >& std::pair<const long long uns
    igned int, boost::shared_ptr<dwarf::encap::die> >::operator=(const std::pair<con
    st long long unsigned int, boost::shared_ptr<dwarf::encap::die> >&)?:
    ude/c++/4.4.3/bits/stl_pair.h:68:   instantiated from ?boost::concepts::VertexLi
    stGraph<G>::~VertexListGraph() [with G = dwarf::encap::dieset]?
    /home/srk31/opt/include/boost/graph/graph_concepts.hpp:166:   instantiated from 
    ?static void boost::concept::requirement<Model>::failed() [with Model = boost::c
    /home/srk31/opt/include/boost/concept_check.hpp:43:   instantiated from ?void bo
    ost::function_requires(Model*) [with Model = boost::concepts::VertexListGraphCon
    test-6.cpp:8:   instantiated from here
    ude/c++/4.4.3/bits/stl_pair.h:68: error: non-static const member ?const long lon
    g unsigned int std::pair<const long long unsigned int, boost::shared_ptr<dwarf::
    encap::die> >::first?, can't use default assignment operator
    In file included from test-6.cpp:1:
    /home/srk31/opt/include/boost/graph/graph_concepts.hpp: In destructor ?boost::co
    ncepts::VertexListGraph<G>::~VertexListGraph() [with G = dwarf::encap::dieset]?:
    /home/srk31/opt/include/boost/graph/graph_concepts.hpp:166:   instantiated from 
    ?static void boost::concept::requirement<Model>::failed() [with Model = boost::c
    /home/srk31/opt/include/boost/concept_check.hpp:43:   instantiated from ?void bo
    ost::function_requires(Model*) [with Model = boost::concepts::VertexListGraphCon
    test-6.cpp:8:   instantiated from here
    /home/srk31/opt/include/boost/graph/graph_concepts.hpp:188: note: synthesized me
    thod ?std::pair<const long long unsigned int, boost::shared_ptr<dwarf::encap::di
    e> >& std::pair<const long long unsigned int, boost::shared_ptr<dwarf::encap::di
    e> >::operator=(const std::pair<const long long unsigned int, boost::shared_ptr<
    dwarf::encap::die> >&)? first required here 
    make: *** [test-6.o] Error 1

    The bottom error refers to line 188 of graph_concepts.hpp, which is doing an innocuous assignment from a vertex iterator v = *p.first;. The variable v is an instance of my map entry type. Somehow, the compiler can't synthesise a copy constructor for the pointed-to vertex. This was incredibly difficult to track down because it wasn't due to any use of const in my code directly. I scoured my code for suspicious const, but to no avail. What had escaped me is that the value_type in a std::map is as follows (from Stroustrup C++PL section

    typedef pair<const Key, T> value_type;

    In other words, maps are designed so that you can't update the key of an entry once they're created. This is very sane, because to do so might violate some invariant of the containing structure (imagining the popular red-black tree implementation). I hadn't noticed this limitation before because although it's common to initialise new pair objects, it's rare to update an existing one. Even though I had purposely avoided making the whole pair const in my code, there was already a const lurking in the header.

    That the compiler couldn't give me a better error message (i.e. one actually pointing to the code at fault) is certainly disappointing, and will be fixed in a better compiler. Whether all this fiddling constness is a weakness with the C++ language I'm not sure. It's certainly self-consistent and well-motivated. Other languages might not have me worrying about const, but they also might not catch certain errors. Is C++'s trade-off good value? It's also not clear whether a better definition of map might not have lifted the const into individual references to the value_type... or not. I'd like to be able to rail and propose some better language design, but part of me suspects that sometimes in programming, the devil really is inescapably in the details.

    [/devel] permanent link contact

    Mon, 01 Feb 2010

    Smart pointers: smart pointers (C++ Gotme number 3)

    I've been writing some C++ code using boost's smart pointers recently. Naturally, I tripped myself up. Memory corruption appeared, and I noticed that my shared objects were the problem. They were getting destructed even though I had a shared_ptr pointing to them the whole time. Here's how it happened.

    create_subprogram(Die_encap_base& parent, boost::optional<const std::string&> name)
        return boost::shared_ptr<Die_encap_subprogram>(new Die_encap_subprogram(parent, name));

    The return statement is evaluated as follows.

    So far, so good. Here's the calling code.

    	std::cerr << "dies size before: " << ds().size();
    	std::cerr << "dies size after: " << ds().size();

    Although my factory method “helpfully” returns a shared pointer, we discard the return value of the factory method. So, for the object to stay alive, we are relying on there being a copy of the boost::shared_ptr<Die_encap_base> somewhere. The code I've shown doesn't take a copy, so where could one come from? Well, one is created inside the object constructor, where it registers itself with a containing data structure. Here's the code which creates that second shared pointer.

    Die_encap_base(Dwarf_Half tag,
            Die_encap_base& parent, 
            boost::optional<const std::string&> name)
         :	die(parent.m_ds, parent.m_offset, tag, parent.m_ds.next_free_offset(), 0, attribute_map(), die_off_list()) // FIXME
            m_ds.insert(std::make_pair(m_offset, boost::shared_ptr<dwarf::encap::die>(this)));
            m_attrs.insert(std::make_pair(DW_AT_name, dwarf::encap::attribute_value(

    If this second shared pointer is going to be sufficient to keep the object alive, we require some “action at a distance”: the first shared pointer must somehow become aware that we have just created another shared pointer to its object. How might this work? The shared_ptr implementation would have to keep its reference counts in a shared table, accessible to all shared pointers, so that they can look up the relevant refcount by from the pointer. The table must be shared between all shared pointer instances, not just those of a given type parameter, to allow upcasting and downcasting---notice how the target type is different in this snippet than in the earlier one, because we're in the base class. For this reason, our hypothetical table implementation would have to catch pointers-to-subobjects, probably by storing address ranges not just object start addresses. Even this is tricky if the shared pointer was created from an imprecisely-typed pointer (i.e. one whose static type is a supertype) because we might be oblivious to some larger enclosing allocation (i.e. the superobject actually allocated). This is doable, albeit not portably, given boost::shared_ptr's caveat that it only applies to pointers that were returned by new---so we could get the full address range using the heap metadata.

    Instead of all this table business, boost::shared_ptr just uses a shared count created when constructing a smart pointer from a non-smart pointer. Each pointer carries the address of a separate object holding the shared refcount. When you copy the shared pointer, it increments the count and propagates the shared count's address into the new copy of the pointer. Naturally, since it relies on continuity of propagation through shared pointers, it doesn't work when that chain is broken---you end up with multiple counts for the same object. My code suffers exactly that problem: it creates a completely separate shared pointer to what is “coincidentally” the same object referenced by some shared_ptr (i.e. communicated one or more intervening non-shared pointers, in my case the this pointer in the constructor call).

    The fix is simple enough: have the factory return a regular pointer or reference to the constructed object. Of course, what if some client code innocently wants to use shared_ptr on that object? That's not necessarily a problem, but it won't magically extend the life of objects that would otherwise be reclaimed according to the object lifetime policies which my own code's use of shared_ptr creates. This ability to create, out of innocuous-looking code, multiple conflicting policies about when an object should reclaimed, is a limitation of the smart pointer approach, and is stronger than simply “shared_ptr won't collect cycles”.

    Slogan-wise, this is perhaps an argument in favour of “smart objects, not smart pointers”. True per-object reference counts can be implemented in at least two ways: either a Python-like in-band solution, where each object has a refcount built-in, or else the unconventional out-of-band table solution I outlined above. The former isn't an option for C++, because we can't retroactively add a count field to an object, and don't want all instances of all classes to pay the overhead. However, the latter solution is quite appealing to me. Keeping object descriptions out-of-band is a powerful technique for retaining binary compatibility while adding run-time facilities, and is the basis of my proposed implementation of Python, which I might just outline in a future post.

    [/devel] permanent link contact

    Wed, 14 Oct 2009

    C++ Gotme number 2: templates, typename and specialization syntax

    Here's some innocuous-looking template definitions in C++.

    // specialize this template!
    template <unsigned spec_id> class Extended_By 
        typedef struct {} spec;
    // specialization for Dwarf 3
    template <> class Extended_By<0U>
        typedef empty_def spec;
    template <unsigned spec_id> 
    class table_def : public Extended_By<spec_id>::spec
        typedef Extended_By<spec_id>::spec Extended;
    // ...

    I am implementing a delegation hierarchy of tables: lookup methods on the tables delegate to their base implementation if the lookup fails. The hierarchy is rooted at a special-cased “empty” definition (not shown). Each non-root table has a similar implementation, but must be a separate class (as it may want to override certain member functions), hence my use of templates. To define a new table, we specialize the latter template with a new integer argument denoting the concrete table being defined. (In case you're wondering, tables are singletons---but they get pointed-to at runtime, hence virtual dispatch.) The Extended_By template is simply a compile-time mapping from integers (which denote tables) to the type of the underlying table (i.e. the table which that table delegates to, if its lookup fails).

    Unfortunately, the code above doesn't compile.

    g++ -Wall -O0 -g3 -I/home/srk31/opt/include -I/usr/include/python2.5 -I/usr/incl
    ude -I../../../libsrk31c++ -c -o "spec.o" "spec.cpp"
    In file included from spec.cpp:8:
    spec.hpp:242: error: type `dwarf::spec::Extended_By<spec_id>' is not derived fro
    m type `dwarf::spec::table_def<spec_id>'

    The error message makes no sense at all. In fact, the problem is that the compiler can't tell that Extended_By<spec_id> is a type. If you insert the typename keyword into the latter template definition, as follows....

    template <unsigned spec_id> 
    class table_def : public Extended_By<spec_id>::spec
        typedef typename Extended_By<spec_id>::spec Extended;
    // ...

    ...then it all magically starts working. I must confess I'm not entirely sure why---surely the usual name lookup on Extended_By would work just fine?

    On a similar theme, I was specializing some predicates defined by the table template, using code as follows.

    template<unsigned spec_id> 
    bool table_def<0U>::attr_describes_location(int attr) const
    	return this->Extended::attr_describes_location(attr)
    		|| attr == DW_AT_location
    		|| attr == DW_AT_data_member_location
    		|| attr == DW_AT_vtable_elem_location
    		|| attr == DW_AT_string_length
    		|| attr == DW_AT_use_location
    		|| attr == DW_AT_return_addr;

    This also fails to compile, with a very confusing message.

    g++ -Wall -O0 -g3 -I/home/srk31/opt/include -I/usr/include/python2.5 -I/usr
    /include -I../../../libsrk31c++ -c -o "spec.o" "spec.cpp"
    spec.cpp:272: error: prototype for `bool dwarf::spec::table_def<0u>::attr_d
    escribes_location(int) const' does not match any in class `dwarf::spec::tab
    spec.hpp:278: error: candidate is: bool dwarf::spec::table_def::at
    tr_describes_location(int) const [with unsigned int spec_id = 0u]
    make: *** [spec.o] Error 1
    The solution is to eliminate the template parameter.

    bool table_def<0U>::attr_describes_location(int attr) const
    	return this->Extended::attr_describes_location(attr)
    		|| attr == DW_AT_location
    		|| attr == DW_AT_data_member_location
    		|| attr == DW_AT_vtable_elem_location
    		|| attr == DW_AT_string_length
    		|| attr == DW_AT_use_location
    		|| attr == DW_AT_return_addr;

    It's not really clear why the latter is a valid syntax for specialization while the former isn't. However, a useful rule of thumb would seem to be that you should only list template arguments up-front (after “template”) if your definition depends on them. It doesn't matter if the originating class template uses more arguments. By contrast, the argument list attached to the classname is what nails down the particular template instance you're specializing for---this argument list should correspond directly to that of the template class definition.

    [/devel] permanent link contact

    Tue, 29 Sep 2009

    Shell Gotme number 1 -- the Heisenbergian process trees

    On my Lab machine, the X login system doesn't clean up stray child processes that your X session may have left behind. (I've a feeling the Debian xinit scripts do this, but I'm not sure.) This was bothering me, because I start some background jobs in my .xsession which I want to die naturally when the session ends. So I put the following towards the end of my .xsession.

    echo -n "End of .xsession: detected living child pids: " 1>&2
    ps --no-header f -o pid --ppid $$ | \
    	while read pid; do kill ; done 2>/dev/null

    Unfortunately, I found that those pesky children were still alive. (Can you tell what's wrong? Yes, that's right.) Both the ps process and the while subshell are among the children which are being killed. So one way or another, the pipeline gets broken before the loop has managed to kill the children I actually wanted to kill. A version which doesn't suffer this problem is as follows.

    child_pids=$( ps --no-header f -o pid --ppid $$ )
    for pid in ; do kill  2>/dev/null; done

    It's another instance of the familiar Heisenbergian nature of inspecting process trees from the shell: doing many basic things from the shell entails creating processes, so inspecting the shell's own process tree is likely to yield perturbed results. Usually it's just an unwanted entry in ps (as with the old ps | grep foo gotcha) but the above shows it sometimes causes subtler problems.

    [/devel] permanent link contact

    Standalone ksmserver, part 2: getting it right

    I blogged previously about my attempts to use KDE's session manager in a standalone fashion (with fvwm, in my case). What I presented there isn't entirely satisfactory, mainly because not all window sizes and positions get restored when restoring the session. I've finally worked out why. Unfortunately, this has also revealed that some content in the previous post is wrong. Allow me to correct myself here.

    So, all you need in your fvwm config is something like the following.

    AddToFunc SessionInitFunction
     + I Exec dcop klauncher klauncher autoStart 3; \
    dcop ksmserver ksmserver autoStart0Done; \
    dcop ksmserver ksmserver kcmPhase1Done; \
    dcop ksmserver ksmserver autoStart1Done; \
    dcop ksmserver ksmserver kcmPhase2Done; \
    sleep 4; dcop ksmserver ksmserver autoStart2Done
    # assuming you quit using a Quit-Verify menu...
    AddToMenu Quit-Verify   "Quit"  SaveQuitSession

    [/devel] permanent link contact

    Fri, 18 Sep 2009

    Python, threading and the madness of language implementors

    I just stumbled across this video of a talk by David Beazley about the implementation of Python and, in particular, threading. In a nutshell, threading in Python is implemented as a complete minimal-effort hack on top of an interpreter that is essentially single-threaded by design. Static state is again the culprit---there's a big lock, called the Global Interpreter Lock, much like Linux used to have the Big Kernel Lock. But, rather than just protecting some carefully-selected critical regions, it protects all Python computation!

    So, alarmingly, the performance of Python threading is absolutely disastrous. It's ironic that Google are heavy users of Python, given that they work with some of the largest datasets on the planet and generally have to know a thing or two about optimisation and concurrency. Of course they don't use Python for their core systems, and are sponsoring an improvement effort called Unladen Swallow.

    I have some research ideas that predicated heavily on the position that language implementors often don't do a great job, particularly in the case of dynamic languages. So if we have to rewrite a bunch of dynamic language implementations, that's not really a terrible thing. I'm glad to have yet more evidence of this.

    [/devel] permanent link contact

    Tue, 15 Sep 2009

    clone() and the unusual process dynamics

    I had a weird problem with my X login scripts recently on my Lab machine---I noticed that for X sessions, the LD_LIBRARY_PATH environment variable wasn't being set, even though every other variable set from my ˜/.profile was still appearing correctly. After I bit of digging, I discovered why, but that only opened up an extra can of mystery.

    The basic problem was that ssh-agent is a setuid program. so at the start of loading, the Linux loader removes all dynamic linker options (including LD_PRELOAD and LD_LIBRARY_PATH) from the environment to avoid running user code with elevated privileges. (It would make more sense just to ignore them, but still to propagate them to children... but anyway.) I was still a bit puzzled though, because I wasn't knowingly running my session as a child of ssh-agent---my login scripts do start one up if it's not already running, but it's supposed to run as a sibling of my main X session script, rather than using the ssh-agent <command> mechanism to start a child session. And ps agreed with me---my session wasn't descended from an ssh-agent process... but I did have two running, where I thought my scripts went to pains to ensure there was only one.

    19233 ?        Ss     0:00 /bin/bash -l /home/srk31/.xsession
    19573 ?        S      0:00  \_ ssh-agent
    19589 ?        S      0:00  \_ fvwm
    19593 ?        Ss     0:00      \_ <... more X clients>
    19433 ?        Ssl    0:00 /bin/dbus-daemon --fork --print-pid 4 --print-address
    19432 ?        S      0:00 /usr/bin/dbus-launch --exit-with-session /home/srk31/

    The explanations for this were somewhat convoluted. Firstly, the Fedora xinit scripts (FC7) do this (in /etc/X11/xinit/xinitrc-common, sourced from /etc/X11/xinit/Xsession):

    # Prefix launch of session with ssh-agent if available and not already running.
    if [ -x /usr/bin/ssh-agent -a -z "" ]; then
        if [ "x" != "x" ]; then
            SSH_AGENT="/usr/bin/ssh-agent /bin/env TMPDIR="

    ...and later (in /etc/X11/xinit/Xsession)...

    # otherwise, take default action
    if [ -x "/.xsession" ]; then
        exec -l  -c "  /.xsession"
    elif [ -x "/.Xclients" ]; then
        exec -l  -c "  /.Xclients"
    elif [ -x /etc/X11/xinit/Xclients ]; then
        exec -l  -c "  /etc/X11/xinit/Xclients"
        # should never get here; failsafe fallback
        exec -l  -c "xsm"

    In other words, they test whether an ssh-agent is running, and arrange to start one if not. But in between testing and starting one, they run a shell---which naturally starts my login scripts. These check for ssh-agent themselves and, finding none, start one. Then later, the Fedora scripts start another one. It's a classic “unrepeatable read” race condition, but without any concurrency---just interleaving of foreign code (my login scripts).

    Next, why wasn't my session showing up as a child of one of the ssh-agent processes? ps's output was doubly confusing because the top of my process tree was a bash -l .xsession process, when that's the last to be launched by the sequence initiated in the Fedora scripts! Well, strace revealed that my processes were using the clone() system call to spawn new processes (which is subtly different from fork(), in that it allows shared address spaces and hence multithreaded programming). As we know, when a process clones itself in order to start a new process, one of the resulting pair replaces itself with the new process image, while the other continues on its merry way. In the case of both ssh-agent and dbus-launch, the parent process was the one which replaced itself, leaving the child to continue the work of SSH agentery or DBUS launchery or whatever. This is really confusing because it contradicts the usual expectations about causal ordering from parent to child processes---but it's perfectly allowed, and has always been possible in Unix.

    What was the fix? Sadly, there isn't a good one---I don't have the permission to edit the Fedora scripts on my Lab machine, and there's no configurable flexibility for disabling the ssh-agent launching or fixing the racey logic. So I added a hack to my .xsession shell script which detects the case where SSH_AGENT_PID is already set to a child of the shell process (since the ssh-agent my scripts created is a sibling) and if so, kills that process and re-sets SSH_AGENT_PID to the one I created earlier (which, handily, I keep stored in ${HOME}/.ssh/agent-$(uname -n)). As usual, completely horrible.

    [/devel] permanent link contact

    Sat, 22 Aug 2009

    C++ Gotme number 1: operator visibility

    I'm doing some C++ coding at the moment. It's a language well-known for “gotchas”, so although it'd be presumptuous for me to assume they'd get you as well, dear reader, I'm going to document some of the things that've caught me out.

    namespace dwarf {
    	namespace encap {
    		typedef Dwarf_Loc expr_instr;
    		bool operator==(const expr_instr& e1, const expr_instr& e2);
    		bool operator!=(const expr_instr& e1, const expr_instr& e2);
    		typedef struct expr
    			Dwarf_Half hipc;
    			Dwarf_Half lopc;
    			std::vector m_expr;
    			expr(const Dwarf_Locdesc& desc) : hipc(desc.ld_hipc), lopc(desc.ld_lopc),
    				m_expr(desc.ld_s, desc.ld_s + desc.ld_cents) {}
    			bool operator==(const expr& e) const 
    				//expr_instr e1; expr_instr e2; // test
    				return hipc == e.hipc &&
    					lopc == e.lopc &&
    					//e1 == e2; // test
    					m_expr == e.m_expr; // error!
    			bool operator!=(const expr& e) const { return !(*this == e); }
    		} loc_expr;

    Here we have some code from my C++ binding of libdwarf. I'm trying to use std::vector's builtin == operator to define equality on my struct expr, which is essentially a wrapper for vectors of Dwarf_Loc objects, where Dwarf_Loc is a structure defined by libdwarf. Here's what g++ makes of the above code:

    ude/c++/4.3.3/bits/stl_algobase.h: In static member function ?static bool std::_
    _equal<_BoolType>::equal(_II1, _II1, _II2) [with _II1 = const Dwarf_Loc*, _II2 =
     const Dwarf_Loc*, bool _BoolType = false]?:
    ude/c++/4.3.3/bits/stl_algobase.h:824:   instantiated from ?bool std::__equal_au
    x(_II1, _II1, _II2) [with _II1 = const Dwarf_Loc*, _II2 = const Dwarf_Loc*]?
    ude/c++/4.3.3/bits/stl_algobase.h:956:   instantiated from ?bool std::equal(_II1
    , _II1, _II2) [with _II1 = __gnu_cxx::__normal_iterator > >, _II2 = __gnu_cxx::__normal_itera
    tor > >]?
    ude/c++/4.3.3/bits/stl_vector.h:1111:   instantiated from ?bool std::operator==(
    const std::vector<_Tp, _Alloc>&, const std::vector<_Tp, _Alloc>&) [with _Tp = Dw
    arf_Loc, _Alloc = std::allocator]?
    dwarfpp.hpp:341:   instantiated from here
    ude/c++/4.3.3/bits/stl_algobase.h:795: error: no match for ?operator==? in ?* __
    first1 == * __first2?
    make: *** [dwarfpp.o] Error 1

    Clearly, my definition for operator== isn't being found. But if I uncomment the blamed line and replace it with some dummy code (commented above) which invokes the operator for two dummy objects, rather than vectors, it works fine. Why can I compare two Dwarf_Locs but not two vectors thereof, when vector defines a perfectly good operator== that should just invoke mine?

    The answer is that vector's operator== can't see my operator== definition, because of namespaces. According to Stroustrup (C++PL, third edition, section 11.2.4):

    Consider a binary operator @. If x is of type X and y is of type Y, x@y is resolved like this:

    So the code in std::vector clearly can't see my definitions in dwarf::encap::. However, the reason that this isn't such a common problem is that I'm defining operators on a type, namely Dwarf_Loc, that was defined not by me but in the pre-existing C library that I'm wrapping. I lazily dumped all of libdwarf's definitions into the global namespace, so the quick fix is to define my operator in the global namespace too.

    namespace dwarf {
    	namespace encap { typedef Dwarf_Loc expr_instr; }
    bool operator==(const dwarf::encap::expr_instr& e1, const dwarf::encap::expr_instr& e2);
    bool operator!=(const dwarf::encap::expr_instr& e1, const dwarf::encap::expr_instr& e2);
    namespace dwarf {
    	namespace encap {
    		//typedef Dwarf_Loc expr_instr;
    		typedef struct expr
    			Dwarf_Half hipc;
    			Dwarf_Half lopc;
    			std::vector m_expr;
    			expr(const Dwarf_Locdesc& desc) : hipc(desc.ld_hipc), lopc(desc.ld_lopc),
    				m_expr(desc.ld_s, desc.ld_s + desc.ld_cents) {}
    			bool operator==(const expr& e) const 
    				//expr_instr e1; expr_instr e2;
    				return hipc == e.hipc &&
    					lopc == e.lopc &&
    					//e1 == e2;
    					m_expr == e.m_expr; // okay
    			bool operator!=(const expr& e) const { return !(*this == e); }
    		} loc_expr;

    Note that if I'd done the right thing and wrapped libdwarf's definitions into some non-global namespace, say dwarf::lib, I'd still need to do something similar, because my operator definition won't be found if I put it in a different namespace like dwarf::encap, even though that's the namespace containing the code which actually needs the operator to be defined.

    Well, it certainly got me... now, on with coding.

    [/devel] permanent link contact

    Mon, 17 Aug 2009

    Ruby fails

    I'm sure Ruby is a nice language, but I get slightly annoyed by two things whenever I try to install a Ruby program. One is that it has its own build system, as a replacement for Make et al---there are things called Rakefiles in the source distributions. The other is that it also has its own configuration and deployment system, based on the setup.rb script. These things annoy me because they're yet more tools that I need to learn, and for no good reason. Python is guilty of the same sins too. A new language shouldn't entail a whole new build system. (I won't go into why, but hopefully you don't need me to. I would also complain that there are too many languages, but I'll save that too.)

    What really annoys me about Ruby is that setup.rb is broken, because it doesn't deal with prefixes properly. If I do ruby setup.rb config --prefix=${HOME}/opt, it still tries to install most of the program under /usr. So I tried giving the --prefix option to ruby setup.rb install too, but that doesn't do the right thing either. Instead it creates me a ${HOME}/opt/usr hierarchy and puts the stuff it was putting in /usr there.

    I might as well come clean and admit that the only Ruby program I routinely install is iplayer-dl. Anyway, my next attempt: configure everything using paths relative to ./, then supply the overall prefix at install time. That doesn't work either---setup.rb interprets the ./ passed with --prefix, so you get install destinations relative to your configure directory. But only for files affected by --prefix, which isn't all of them.

    Next attempt: configure everything relative to the root directory, then supply the prefix at install time. This does work, but you wouldn't guess from the output of ruby setup.rb install.

    $ rubyver=$( ruby --version | sed 's/ruby \([0-9]*\.[0-9]*\)\..*/\1/' )  # horrible
    $ rubyarch=$( uname -i )-$( uname -s | tr '[:upper:]' '[:lower:]' )           # horrible
    $ ruby ./setup.rb config --prefix='/' --sysconfdir=/etc \
    --libruby=/lib/ruby --librubyver=/lib/ruby/ --librubyverarch=/lib/ruby// \
    --siteruby=/lib/ruby/site_ruby --siterubyver=/lib/ruby/site_ruby/ --siterubyverarch=/lib/ruby/site_ruby//
    $ ruby $ ruby ./setup.rb install --prefix=${HOME}/opt
    rm -f InstalledFiles
    ---> bin
    mkdir -p /home/srk31/opt//bin
    install iplayer-dl //bin//iplayer-dl
    install iplayer-dl-gui //bin//iplayer-dl-gui
    <--- bin
    ---> lib
    mkdir -p /home/srk31/opt/lib/ruby/site_ruby/1.8
    install iplayer.rb /lib/ruby/site_ruby/1.8/
    ---> lib/iplayer
    mkdir -p /home/srk31/opt/lib/ruby/site_ruby/1.8/iplayer
    install downloader.rb /lib/ruby/site_ruby/1.8/iplayer
    install subtitles.rb /lib/ruby/site_ruby/1.8/iplayer
    install metadata.rb /lib/ruby/site_ruby/1.8/iplayer
    install preferences.rb /lib/ruby/site_ruby/1.8/iplayer
    install browser.rb /lib/ruby/site_ruby/1.8/iplayer
    install version.rb /lib/ruby/site_ruby/1.8/iplayer
    install errors.rb /lib/ruby/site_ruby/1.8/iplayer
    ---> lib/iplayer/gui
    mkdir -p /home/srk31/opt/lib/ruby/site_ruby/1.8/iplayer/gui
    install main_frame.rb /lib/ruby/site_ruby/1.8/iplayer/gui
    install app.rb /lib/ruby/site_ruby/1.8/iplayer/gui
    <--- lib/iplayer/gui
    <--- lib/iplayer
    <--- lib
    So, handily it's told me that it installed a bunch of stuff in /lib/ruby, which is exactly what I didn't want it to do. But, since I ran it without elevated permissions, I know that it can't have succeeded in doing that---yet there were suspiciously no error messages. Lo! Despite what it printed out, it has actually put the lib files in ${HOME}/opt/lib/ruby just as I wanted. Now, why was that so difficult?

    To make matters worse, you of course have to set your Ruby path to get the deployed program to work, and that is also horrifying:

    $ export RUBYLIB=${HOME}/opt/lib/ruby:${HOME}/opt/lib/ruby/site_ruby:${HOME}/opt/lib/ruby/site_ruby/1.8:
    ---disgusting, especially embedding the 1.8 version number in the path which will be seen (and interpreted) by any version of Ruby at all. It's following the pattern established Python, of course, and since Python has some reasonably sane people behind it I'm tempted to suspect that this ludicrous scheme has been selected for a reason---but even if this suspicion is correct, it'll have to be a very good one, and I somehow doubt that.

    [/devel] permanent link contact

    Fri, 13 Mar 2009

    ANTLR recipes

    I've been fiddling with ANTLR recently, trying to create a parser for my Cake language. Unfortunately I found it trickier than I was hoping, and this forced me to take a step back and create some simpler grammars so I could get the hang of how to realise some common grammatical features in ANTLR. Specifically, I looked at four features: recursive prefixing (i.e. a recursive unary prefix operator, so right-recursive), recursive suffixing (the same but suffixing, left-recursive), right-associative binary operators and left-associative binary operators.

    The appeal of ANTLR has been its ability to build ASTs in special-purpose syntax (agnostic towards the host language) rather than relying on semantic actions. Unfortunately it took me a while to get the hang of these tree construction features. The basics are here, but here's some helpful extra snippets and emphasis that took me a while to discover.

    I think that's all for now. I still have to get a good handle on operator precedence, which may or may not spawn another post.

    [/devel] permanent link contact

    Fri, 27 Feb 2009

    Standalone ksmserver, part 1

    I've been looking for an X11 session manager that actually works recently (since sadly xsm doesn't fit that bill) and was experimenting with KDE's session manager. It's peculiarly underdocumented but seemed like it should have all the functionality I needed, and should also be reasonably well-used and well-tested. So I was a bit disappointed when my basic set-up attempt at integrating it with fvwm appeared not to work. I was simply replacing the fvwm command in my .xsession with the following:

    ksmserver -r -w fvwm

    which should launch fvwm rather than kwin. Then in my fvwm configuration, to end the session properly when I quit the window manager, I tried the following.

    AddToFunc SessionInitFunction
     + I Exec echo hello from SessionInitFunction
    AddToFunc SessionExitFunction # kill ksmserver when we exit
     + I Exec dcop ksmserver ksmserver logout 0 2 2  
    ## from http://andrejserafim.wordpress.com/2008/05/16/kde-shutdown-logout-restart/
    #First parameter: confirm
    #Obey the user?s confirmation setting: -1
    #Don?t confirm, shutdown without asking: 0
    #Always confirm, ask even if the user turned it off: 1
    #Second parameter: type
    #Select previous action or the default if it?s the first time: -1
    #Only log out: 0
    #Log out and reboot the machine: 1
    #Log out and halt the machine: 2
    #Third parameter: mode

    (Thanks to Andrej Kazakov for the summary of logout's invocation that I pasted above, in turn extracted from the KDE source code.)

    Of course, it didn't work, so I put my developer hat on and had a look. Attaching gdb revealed that the session manager was entering a (non-tight) endless loop when I requested the logout---setting a timer which fired after one second, disabled itself and then recursively re-started the process. The problem was that the session manager has an internal state machine, and in my case it was stuck in state AutoStart0, whereas it was expected to end up in Idle after a while---the only state from which it can initiate a shutdown.

    To get out of AutoStart0, and subsequent start-up states, you have to manually call a bunch of other DCOP methods with names like autoStart0Done, kcmPhase1Done and so on. This is among what startkde does for KDE users, after performing various KDE-specific configuration updates. (These updates are exactly what a non-KDE user doesn't want to happen, at least in my case---since one major reason I don't use KDE is that I like to stick with the simple X11-provided ways of controlling backgrounds, cursors, input devices and so on.) We can manually invoke the DCOP methods to signal that it's time for the next state.

    We successfully avoid most of the KDE interference in this way, because the relevant other processes (like kcminit) haven't been launched. However, we do get some, because the klauncher process does exist -- it's started when running any KDE app if not already running. In particular, ksmserver's signals to klauncher have the unwanted consequence of starting up a bunch of “autostart” applications, like the KDE desktop background and panel, that have shortcuts in ~/.kde/Autostart and /share/autostart/. To avoid this, we tell the launcher to bump our start-up “phase”, which is just an integer, to a high value---since autostart apps are marked with a “phase” attribute, and are only started when moving through that phase. ksmserver only covers as far as phase 2, so we can set the phase to 3 and they won't be started up. So, here's my final fvwm config for use with ksmserver.

    [Update, 2009-03-02: not so fast! What I had here earlier didn't quite work, because fvwm doesn't necessarily execute each part of a compound function in sequence. So instead, it needs to all be rolled into one command. What's more, I had to hack in an arbitrary sleep because, depending on how long it takes to start up all the saved clients, the kcmPhase2Done call may come too soon (state machine not yet progressed into FinishingStartup, I think) and be ignored (causing ksmserver to get stuck in the latter state). Now, what follows does seem to work.]

    AddToFunc SessionInitFunction
     + I Exec dcop klauncher klauncher autoStart 3; \
    dcop ksmserver ksmserver autoStart0Done; \
    dcop ksmserver ksmserver kcmPhase1Done; \
    dcop ksmserver ksmserver autoStart1Done; \
    dcop ksmserver ksmserver kcmPhase2Done; \
    sleep 4; dcop ksmserver ksmserver autoStart2Done
    # signal ksmserver when we exit
    AddToFunc SessionExitFunction 
     + I Exec dcop ksmserver ksmserver saveCurrentSessionAs "saved at previous logout"; \
     dcop ksmserver ksmserver logout 0 0 3 

    I still haven't actually made the session management functionality work, which I think is going to take some more commands along “save” and “restore” lines. [Update, 2009-03-02: turns out to need just the one “save” command at exit, as shown above -- restoring happens by default, using the magic session name shown.] My ultimate goal is the ability to start multiple ksmserver processes, so that I can independently save and restore separate groups of clients within a single X login session. Fingers crossed... if it's nontrivial I'll write a follow-up post. There's also the joy of KDE 4 to think about later, which has exchanged DCOP for D-BUS and may well alter the undocumented behaviour I'm relying on.

    The Important Science to take away from all this is I suppose that interface protocol specifications and timing specifications are useful! Not just for checking, which seems a long way off here, but just for understanding. It'd also be useful to have more convenient support for interposing on DCOP communication, by having client-specific message routing (or name-bindings) so that the “phase” hack could be handled a bit more elegantly by cutting off the connection to klauncher.

    [/devel] permanent link contact

    Thu, 19 Feb 2009

    Initialization order

    I just spent an age tracking down a bug in a very simple tool I'm writing using the GNU BFD. It turns out that when opening a file for reading with BFD, various API calls are not valid until bfd_check_format_matches has been called and has returned successfully. This doesn't appear to be documented in the current texinfo manual. In fact, even the need to call bfd_init is not very clearly documented---but at least that one is fairly predictable. Most libraries, particularly those which maintain a lot of state behind the scenes, have an init function. Additional initialization steps, like the one I ran afoul of, are not at all guessable and should really be clearly signposted in the documentation. It was only by careful examination of what objcopy does that I could track down the bug.

    It's well-known in the research world that interfaces come with protocol constraints, and it'd be nice to check client code against these just as we already do with function signatures. Initialization constraints are by far the most common example of protocol constraint. But rather than just checking satisfaction of these constraints, why should we even have to write the code at all? Why can't our toolchain insert initialization calls in the right places automatically? In fact at least one research linkage tool (Knit) can already do this.

    How far can we generalise this constraint-solving? Usually protocol specifications use a finite-state model of component state, which are clearly good for a wide range of protocols but not all (consider a stack). Of course, we want our checking/solving to be decidable, and more complex models quickly lose this property, although I'm not sure whether a pushdown automaton model would necessarily be undecidable. More practically, Knit essentially supports a single bit of state (initialized or not) for each component (or actually a tri-state, because it supports finalizers too). If we wanted to generalise this even to a general finite-state model, we'd get ambiguity: say we had to call function A one or more times before calling function B, how many calls would we insert? Maybe “the smallest string” would be a sensible policy, but that would be ambiguous in some cases, and it's not clear it's always a good choice anyway. In protocol adaptation, there is invariably some programmer-provided “specification” to answer these questions, usually by constraining the generated automaton.

    The BFD example is unusual in another way, in that it requires a successful call to the init function, not just any call. So our protocol checker/solver has to understand return values (and of course arguments) as well as function names. It's about time I re-examined the research literature on protocol adaptation and worked out how much of it has actually been turned into practical tools that address these issues. In due course I'm hoping to build a practical subset (or superset) of protocol adaptation functionality into Cake. One thing at a time though... first I must master that pesky BFD.

    [/devel] permanent link contact

    Fri, 23 Jan 2009

    Library path strangenesses

    It's about time I began sharing the hard-learnt arcane development knowledge I've managed to pick up. One of the most annoying “features” I've had to reverse-engineer is in the behaviour of Linux's dynamic linker, ld-linux. I have a lot of directories in my LD_LIBRARY_PATH environment variable.

    srk31@font:~/scratch/kde$ echo 

    They all get searched as you'd expect. Each directory is expanded to include searches for various architecture-specific variant subdirectories like tls, sse2, nosegneg etc.

    srk31@font:~/scratch/kde$ LD_DEBUG=libs ldd bin/konsole 2>&1 | head
          5666:     find library=libtinfo.so.5 [0]; searching
          5666:      search path=/home/srk31/scratch/opt/lib/tls/i686/sse2/nosegneg:/home/srk31/scratch/

    (snipped some -- view source for more)

    /lib            (LD_LIBRARY_PATH)
          5666:       trying file=/home/srk31/scratch/opt/lib/tls/i686/sse2/nosegneg/libtinfo.so.5
          5666:       trying file=/home/srk31/scratch/opt/lib/tls/i686/sse2/libtinfo.so.5
          5666:       trying file=/home/srk31/scratch/opt/lib/tls/i686/nosegneg/libtinfo.so.5

    Notice that in my LD_LIBRARY_PATH, /usr/lib and lib are right at the end. Once I tried tweaking the ordering so that some paths came after these -- they were “backup” libraries I'd compiled myself in case the machine I was using didn't have them, but I wanted to pick up any local ones in preference. Unfortunately, this doesn't work. ld-linux will ignore all directories after the first directory it encounters that is part of the “system library path”.

    srk31@font:~/scratch/kde$ LD_LIBRARY_PATH=/usr/lib:${LD_LIBRARY_PATH} LD_DEBUG=libs ldd bin/konsole 2>&1 | head
          5687:     find library=libtinfo.so.5 [0]; searching
          5687:      search path=/usr/lib/tls/i686/sse2/nosegneg:/usr/lib/tls/i686/sse2:/usr/lib/tls/i68
    b/sse2/nosegneg:/usr/lib/sse2:/usr/lib/nosegneg:/usr/lib                (system search path)
          5687:       trying file=/usr/lib/tls/i686/sse2/nosegneg/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/i686/sse2/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/i686/nosegneg/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/i686/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/sse2/nosegneg/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/sse2/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/nosegneg/libtinfo.so.5
          5687:       trying file=/usr/lib/tls/libtinfo.so.5

    I have no idea why it does this, but would guess it's intended behaviour for some reason. It's annoying though.

    Oh, and another gotcha relating to LD_LIBRARY_PATH is that if you're using gdb, it seems you must set solib-search-path to match your LD_LIBRARY_PATH, because gdb seems to ignore the latter (again, probably for some reason or other). And if you ever use a frontend to gdb, like DDD, that probably has its own setting too. There is so much fun to be had with these things.

    [/devel] permanent link contact

    Powered by blosxom

    validate this page