Rambles around computer science

Diverting trains of thought, wasting precious time

Thu, 14 Oct 2021

Tracing system calls in-process, using a chain loader

Some months back I wrote about a possible better alternative to LD_PRELOAD using chain loading. And I promised a follow-up with more details... so here it is. We'll walk through a little system call tracer built using this technique. Unlike strace and similar, it will do its tracing in-process, without using ptrace() or equivalents. I should add that it's a proof-of-concept only so isn't nearly as good as strace in practice; in particular it can't yet accurately print the arguments to most syscalls. (I have some work towards doing that generatively, using kernel DWARF; more later.)

System call tracing is a good demo of chain loading, for three reasons.

Firstly, it's basically impossible to do with LD_PRELOAD, which works at the higher level of the C library and dynamic linker, so doesn't have sight of system calls.

Secondly, it's close to what people very often want to do with LD_PRELOAD, which is to change the behaviour of particular system calls. No longer do we have to make do with overriding the C library wrappers; we can intercept the system calls themselves, catching certain cases that would not be visible at the C library level.

Thirdly, intercepting system calls specifically provides the foundation for a more general bootstrapping approach to arbitrary binary instrumentation of programs, as I hinted last time. If we can catch all the system calls which introduce new instructions to the address space, and prevent instructions from being introduced otherwise, we can ensure that literally all code is instrumented.

Some years ago, Patrick Lambein and I wrote a library that can patch and emulate most system calls on x86_64 Linux, by a really simple double-trap technique: overwrite the system call with ud2, x86's “always-illegal” instruction, then handle the resulting SIGILL. The handler can do the real syscall or whatever else it wants, before returning. This is not the fastest, but it's reliable and easy(-ish) to implement—at least to 90% of the way. So, for example, if we disassemble the C library's _exit function which once contained the following instructions

44 89 c0                mov    %r8d,%eax
0f 05                   syscall  
48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
76 e2                   jbe    0x7f173bf269c0 <_exit+32>

under our library this gets clobbered to instead be

44 89 c0                mov    %r8d,%eax
0f 0b                   ud2  
48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
76 e2                   jbe    0x7f173bf269c0 <_exit+32>

which will always cause a SIGILL. Our SIGILL handler scrapes up the syscall arguments into a structure and then passes this to whatever registered handler is installed. For example, a tracing handler, which prints some stuff and then resumes the system call, looks like this.

void print_pre_syscall(void *stream, struct generic_syscall *gsp, void *calling_addr, struct link_map *calling_object, void *ret) {
        char namebuf[5];
        snprintf(namebuf, sizeof namebuf, "%d", gsp->syscall_number);
        fprintf(stream, "== %d == > %p (%s+0x%x) %s(%p, %p, %p, %p, %p, %p)\n",
                calling_object ? calling_object->l_name : "(unknown)",
                calling_object ? (char*) calling_addr - (char*) calling_object->l_addr : (ptrdiff_t) 0,
// what actually gets called from the SIGILL handler
void systrap_pre_handling(struct generic_syscall *gsp)
        void *calling_addr = generic_syscall_get_ip(gsp);
        struct link_map *calling_object = get_highest_loaded_object_below(calling_addr);
        print_pre_syscall(traces_out, gsp, calling_addr, calling_object, NULL);

To ensure that all loaded code gets instrumented appropriately with the ud2s, I've recently expanded this to use a bootstrapping technique: after initially applying our trapping technique to hook mmap(), mremap() and mprotect() calls, our hooks ensure that if the guest application is asking to create any further executable memory, we will instrument its contents before we make it executable.

But how can we get all this code into the process in the first place?. What I did initially was use our old familiar LD_PRELOAD. We preload a library whose initializer stomps over the loaded text, clobbering syscalls into ud2 and installing the SIGILL handler. This is fine as far as it goes, but it misses the very beginning of execution. A tracer doing this cannot trace anything happening before the initializer runs, for example in other libraries' initializers or inside the dynamic linker. What we get is the following.

$ LD_PRELOAD=`pwd`/trace-syscalls.so /bin/true
== 26439 == > 0x7f6fb2a4f9d4 (/lib/x86_64-linux-gnu/libc.so.6+0xc99d4) exit_group(0, 0x3c, 0, 0x7fff766064b0, 0xe7, 0xffffffffffffff80)

It's neat that it looks like true only does one syscall, to exit_group(), as we might think it should. But if we try strace we find that in truth the process does a lot more than that one system call.

$ strace /bin/true
brk(0)                                  = 0x2447000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6ba773c000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libfakeroot/tls/x86_64/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib/x86_64-linux-gnu/libfakeroot/tls/x86_64/x86_64", 0x7ffd49ca38e0) = -1 ENOENT (No such file or directory)
... snipped a *lot*! ...
arch_prctl(ARCH_SET_FS, 0x7f6ba7578740) = 0
mprotect(0x7f6ba7732000, 12288, PROT_READ) = 0
mprotect(0x605000, 4096, PROT_READ)     = 0
mprotect(0x7f6ba7765000, 4096, PROT_READ) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

All these other, earlier system calls are happening during dynamic linking, or in general, before our constructor that clobbers the syscalls into ud2. (The strace manual page laments: “it is a pity that so much tracing clutter is produced by systems employing shared libraries”. My take is that it's better to know the truth!)

Let's repackage our code as a chain loader, as described last time. That way it will always run first, and because it's responsible for loading the dynamic linker, we can ensure that all its syscalls are trapped from the very start. This is a non-trivial delta to the donald-based code I mentioned in the last post. After we map the “real” dynamic linker, we also need to instrument its code. We do this by wrapping the enter() function from donald.

void __wrap_enter(void *entry_point)
        /* ... */
        trap_all_mappings(); // only ld.so is loaded right now
        /* ... */

What about code that the dynamic linker loads? Well, we use our bootstrapping approach to catch that. Whenever new code is loaded into the process, some system call is responsible for doing that. So as long as we catch that call, we can trap the system calls at load time. A bit of macro magic helps generate the wrappers, and some elided hackery is required to divine the ELF-level structure of the file being mapped.

#define mmap_args(v, sep) \
    v(void *, addr, 0) sep   v(size_t, length, 1) sep \
    v(int, prot, 2) sep      v(int, flags, 3) sep \
    v(int, fd, 4) sep        v(unsigned long, offset, 5)
    if (!(prot & PROT_EXEC)) return mmap(addr, length, prot, flags, fd, offset);
    // else... always map writable, non-executable initially
    int temporary_prot = (prot | PROT_WRITE | PROT_READ) & ~PROT_EXEC;
    void *mapping = (void*) mmap(addr, length, temporary_prot, flags, fd, offset);
    if (mapping != MAP_FAILED)
        /* ... */
        if (/* needs trapping*/)
            inferred_vaddr = (uintptr_t) mapping - matched->p_vaddr;
            trap_one_executable_region_given_shdrs(mapping, (void*) mapping + length,
                    filename, /* is_writable */ 1, /* is_readable */ 1,
                    shdrs, ehdr.e_shnum,
            // OK, done the trap so let's fix up the protection (always different)
            ret = mprotect(mapping, length, prot);
            if (ret != 0) goto fail;
            /* ... */
        /* ... */
    /* ... */

Making this work on Linux/x86 (32- and 64-bit) required some “interesting” code in a few cases, over and above the chain loader trick. The current version still has some unsupported features (sigaltstack() among them) but can support a lot of stuff thanks to the following gyrations.

The good news is that things mostly work. In particular, I can do:

$ ./trace-syscalls-ld.so /bin/true

and get “full” output analogous to that of strace, instead of the old output that was missing many system calls.

What about correctness of all this? This bootstrapping approach only gets us 100% coverage of system calls on some assumptions.

Assumption 1 isn't true for gettimeofday() in Linux's vdso, which implements it as a read from kernel memory. No “interesting” system calls are implemented that way. Do we really just mean “system calls”, though? One could argue that fault-handling paths are in some ways like system calls; we can't trace entries and exits to/from these, though neither does strace. Signal delivery is also a bit like a return from a system call, and we don't trace that, in contrast to strace (which gets any event ptrace exposes, and that includes signals). We could trace some of it by installing our own handlers for all signals, though that would require a lot of care to preserve the program's original semantics... it starts to become a full-on “transparency” problem, of the kind DynamioRIO's authors have articulated.

For assumption 2a to be true, we must forbid mappings that are simultaneously writable and executable. Some programs, such as the just-in-time compilers of V8 or PyPy, still likes to create executable memory and write into it. If we allow such mappings, we have no synchronous opportunity to modify code before it becomes executable. My best guess is to allow these programs to carry on with an emulation of writable-and-executable mappings, adaptively switching between really-writable and really-executable and handling the segmentation faults in user space. The overhead would be great if done naively, but some kind of adaptive hysteresis might handle it well enough (even maybe as simple as “flip after N faults”).

A more subtle version of the same problem is that we need to avoid distinct simultaneous W and X mappings of the same underlying object, such as a file that is MAP_SHARED. That would have the same effect as a writable-and-executable mapping. Such a check is easy in principle, but requires tracking the inode and device number behind every existing file mapping. (File descriptors are OK because we can fstat() them on demand.) Attempts to remap the same object should probably return -ETXTBSY. I've left that as an exercise for someone else, for the moment.

Assumption 2b brings the issue of overt instructions versus punny instructions. On x86 there is no alignment requirement for instructions, so we can jump into the middle of instructions and thereby do something quite different than intended, perhaps including system calls. For example, the innocuous instruction

 mov    $0x9090050f,%eax

assembles to

 b8 0f 05 90 90

but if we jump one byte into this byte sequence, we do a syscall followed by two nops. Preventing this is tricky. Perhaps we trust code not to do this. Perhaps we built the program using some heavyweight control-flow integrity mechanism (and can also check that dynamically loaded code satisfies its requirements). Perhaps we are on an alignment-preserving architecture (so, not x86). For x86, once we have a sufficiently general jump-based trampoline mechanism, perhaps itself making use of instruction punning, we can forcibly stub out any instruction or instruction pair that internally contains a system call byte sequence. Doing this is sufficient only if our trampoline itself is guaranteed not to include such instructions. Such a guarantee is quite hard when immediates are involved... we would have to forbid them to contain the bytes 0f 05, for example. The same is true for addresses or offsets, but these are easier under PC-relative addressing (we get to choose where to place the trampoline). Most likely we can rely on such immediates being so rare that if we simply clobber the instruction with a ud2 and emulate it with something like libx86emulate, nobody will notice any slowdown.

Erk. That was a deeper dive than I planned. But the take-home is that early experience suggests this seems to be possible. We can intercept system cals using a chain loader, bootstrappingly, and so get control of the program's interactions across the system call interface. This lets us build all manner of tools rather like LD_PRELOAD, but directly at the system call level and with stronger coverage properties.

[/devel] permanent link contact

Powered by blosxom

validate this page