Rambles around computer science

Diverting trains of thought, wasting precious time

Tue, 26 May 2015

Polymorphism and observability

[In an earlier post, I talked about debugging, and more generally “observability”, in ML-family languages. Later, I also clarified what I think by “polymorphism” most usefully means. This post explores the less obvious relationships between polymorphism and debugging.]

When I talk to language-minded people about debugging code in ML-family languages, they tend to think that the difficulties are something to do with “the type system” and “polymorphism”. This is only half right. Polymorphism does complicate debugging, but it does so in every language, even ones with no “type system” of any kind. (As a sanity check, ask: does BCPL support polymorphism? The answer is clearly yes, at least according to my earlier definition.)

The axe I'm grinding in this series of posts is that “polymorphism” or “type-mumble” are no excuse for a lack of decent observability in ML-like languages. There are no major obstacles to implementing a nicely debuggable ML, and certainly none regarding polymorphism. Intuitively, this makes sense if we remind ourselves that polymorphism is to do with abstraction, whereas observation is done on the concrete: we're observing a concrete program state, laid out in front of us. (Of course, there are some unfortunate decisions taken by existing implementations of ML-like languages, that make retrofitting debuggability more difficult than it might be. That's a very different problem!)

A similar viewpoint explains why other kinds of fancy types present no obstacle. Even using fancy features of OCaml or Haskell, like GADTs or type classes, our program state still boils down to a big pile of atoms, sums, products and functions. The innovations in “type systems” have been in reasoning, specifically about safety. This is a question of dynamics: checking at compile time that invalid structures will never be constructed at run time. Observability isn't concerned with dynamics; it's about looking at the present, not the future. All we want to do is decode a static snapshot of the program. (Here I'm excluding the debugger feature of “altered execution”, i.e. enacting debug-time side-effects on the debugged process. How to do this safely is an interesting question, but I'm not going to dwell on it here.)

Can we expose more clearly why polymorphism isn't a problem? As I covered last time, “polymorphism” is a fancy word for deferring specialisation. Specialisation can be done in a compiler or at run time. At run time, specialisation means execution: execution is a specialisation process that culminates in the program's result. We can also think of this process as “decision-taking” in response to input.

Polymorphism during execution, witnessed in program snapshots, is very different from polymorphism in source programs. In source programs, the whole program is oriented around an unknown future: programs describe dependency on an “input”, supplied later. By contrast, observing a program state at run time is about decoding a present, not a future. Moreover, to help us do that decoding, we can exploit all the decisions that have been taken so far, i.e. all the specialisation that has occurred, during both compilation and execution, to reach the present state. Some of this specialisation can be called “monomorphisation”, because it has taken generic code and applied it in a specific context.

As before, I'll focus on OCaml. The OCaml compiler turns polymorphic source-level functions into generic run-time function objects (instruction sequences). Similarly, for polymorphic data types in source code, the compiler selects a size and layout, independent of any type variables that might be parameterising the definition. As we would expect, this is achieved using indirection: the fixed size and layout ensure that locally, storage can always be allocated and accessed generically. The specialised part of the data—for example, the payload of a list node—is indirected away, using pointers. If we were talking about C, these pointers would be pointers to void. OCaml's source language of data types lets us be more precise about these, by introducing type variables. But that's a meta-level bonus that helps us reason about dynamics. A snapshot of an OCaml program still reveals it as consisting of allocated objects and pointers between them, just like in C.

Viewed as source code, a program consists largely of functions and data types. But during execution, we have other things too: activations of functions and instances of data types. It's usually these that we want to inspect when debugging. For example, a backtrace is a list of function activations. The heap is a collection of values—instances of some type. (It also contains closures, which are a bit of both; I'll ignore them, for simplicity, but you can ask me at the end.)

Here is the key observation about polymorphism at run time. Whenever a polymorphic function is activated, or when a polymorphic data type is instantiated, some instantiation of its type parameters is morally decided. “Morally” means that we could define an oracle, using the semantics of the language, that tells us how they are being instantiated. For example, it could tell us that at some given allocation site, we're creating a list of int rather than a list of 'a (whereas the latter is all the source can tell us). Exactly what this means, however, is slightly subtle.

One subtlety is that the code doing the allocation doesn't necessarily know what this instantiation is. That code might itself be generic! So maybe we're building an int list list out of some int lists. The code doing this might only know it's building an 'a list list, but our oracle would still tell us that the allocation “morally” has the more precise type int list. Another subtlety is that, of course, there's no guarantee at the implementation level that our runtime is helpfully defining any such oracle for us. Nor need the compiler have provided us any output that would help us implement the oracle. In the case of OCaml, these are definitely not true, and that's precisely why it's difficult to add debugging support to the current OCaml toolchain.

Another subtlety is that the “instantiation” does not necessarily yield something free from type variables. Although our int list list example got rid of all the variables, in some other cases we might find the best we can do is to instantiate 'a with 'b -> 'c, say. But this turns out not to stop us from observing anything we might logically be able to observe. I'll return to this shortly.

One way to make OCaml debuggable might be to directly implement this oracle, by maintaining extra state at run time. Whenever we call a polymorphic function or instantiate a polymorphic data type, we could stash information somewhere that explicitly records how the type parameters are being instantiated. Something quite similar was done a while back in the HashCaml project. Unfortunately, it's a fairly invasive change to the compiler. It's likely to meet resistance via a performance argument, which you can think of this as the “frame pointer” debate but for type information. Pushing around extra information creates a bit more pressure on registers and memory, so typically shaves a few percent off performance. In return, we make observability massively more straightforward. Apparently opinions differ on whether this is a good trade. All I'll say is that if frame pointers are good enough for the Linux kernel, they're good enough for me.

Instead of tracking allocation types up-front, one could do a deeper analysis to recover the same information on demand. If we assume we have a backtrace, that gives us a powerful chunk of context: it describes a nest of function activations. The top-level activation (which would be main(), if OCaml had main()) is always monomorphic, so we should be able to figure out all the subsequent instantiations all the way down the stack. Or, we can flip that around: starting from a given activation, we should be able to figure out any type variable instantiations by looking some distance up the stack, and in the worst case, all the way to the top. Currently this is what my OCaml-implementing colleagues prefer; they expect it can work by looking no more than a few frames up the stack in the common case. The logic involved is basically the same as that of the compile-time type checker—which now needs to be replicated in the debugger and/or the language runtime. That's an annoying chunk of replicated stuff, which I find distasteful. Also, this inference might be expensive—fine for an interactive debugger, but poor for other applications of run-time type information (like serialization routines or a tracing tool, say). The advantage is that it requires fewer changes to the compiler.

A third option would be to relax our aim of recovering source-level types. In practice, we don't necessarily care, at debug time, that we're looking at an int list. It might be enough to look at each list node individually, seeing that it's a Cons, and then, separately, discover that each Cons points to an int. In this way we've avoided “typing” at the same granularity that the OCaml language does typing, but we've still recovered a somehow “typed” view of the program (i.e. one interpreted in terms of source-level data types). Put differently, source-level types like list encode information about a whole structure, spanning multiple allocations. Perhaps all we really need is a piecewise, per-allocation view. Currently, OCaml's tagged-pointer implementation ensures that at word granularity, we can distinguish integers from pointers. That's not enough, because we can't, say, distinguish the first variant of type T from the first variant of type U, nor from the integer 0: all are encoded as a zero word. But if we add local tracking of ADT variants and a few other things, that might be enough for observability purposes, and would be less invasive than a full HashCaml-style solution. I find this promising, although I'm still working through the consequences.

Suppose we stick with our oracle-based approach, tracking a source-level type for each allocated value. There seems to be a complication. I mentioned that type parameters are decided at instantiation points. but also that we might only be deciding that 'a becomes 'b -> 'c, say—we're not fully monomorphising them. This makes sense, and just reflects the nature of functions. Suppose we have a list of functions. A perfectly valid such list might contain the list head function hd. That's a generic function of type 'a list ->'a. When we instantiate our 'a list to one that can hold this function, we've specialised type parameter 'a to 'b list -> 'b. Our list is still polymorphic: we haven't got down to a monomorphic type. Does that mean we're lacking the ability to observe something in our program state? The answer is a resounding “no”! I mentioned that when debugging, we're looking at the present and not the future. The polymorphism in hd encodes the unknown future: we don't yet know what types of arguments the functions in the list will be applied to (it hasn't happened yet!). So, these polymorphic-at-run-time values in our heap represent the residual genericity in delayed computations, i.e. in functions. Functions encode things our program hasn't done yet, but might. They don't present an obstacle to decoding the current program state. In practice, any function has a name, even if it's a fake one generated by the compiler from the source code coordinates of a lambda. If we're in a debugger, getting the name of that function (or those coordinates) is comfortably good enough.

There's a final apparent complication. What about the empty list, or the null pointer? These seem to be polymorphic values. But unlike functions or data types, they're not going to get specialised further by activation or instantiation. The simplistic answer is that these values are okay because they're degenerate cases. It's not a practical loss of observability at run time if we can't answer the woodchuck-esque question of “what type of non-empty list would this empty list be if it wasn't empty?”. A more subtle answer is that these values aren't really polymorphic at all. If we think of how we would define the data type 'a list, we see that the Nil constructor, viewed in isolation, isn't polymorphic—it doesn't use 'a. In the context of this constructor, 'a is a “don't care” or ignored argument. An unparameterised constructor is only vacuously polymorphic: its meaning doesn't actually depend on the parameter. (This view, which sees constructors as somewhat independent of the type definition that encloses them, is one which OCaml's polymorphic variants directly build on.)

Finally, I alluded to some parallels with C. Just as the pointers which allow generic layouts for ML data types are equivalent to void pointers, so we have a similar problem when debugging C code: what's on the end of a void*? If I'm looking at a generic linked list node in my debugger, say, the debugger won't let me follow the pointer to the payload data. For that, we would need some run-time service that can look up some metadata about arbitrary memory locations and tell us what's stored there. Java-style VMs solve this problem using object headers. Clearly we don't have this in C; we need some extra infrastructure to answer these questions. I've been working on it: it's called liballocs. By dynamically tracking allocations in our running program, and using some carefully crafted associative data structures, we can build up a fast mapping from arbitrary pointers to metadata about the pointed-to allocation.

In fact the reason I got interested in this topic was that I wanted to make liballocs understand allocations made by OCaml programs. One of the complications liballocs has to deal with is polymorphic allocation sites. These sometimes occur in C code. For example, we might malloc() an array of generic void*, say, but actually use it to hold some specific kind of pointer. Genericity like this is occasional in C, and commonplace in ML. But there's no fundamental difference: code can be generic regardless of whether our source language includes a type language for describing that genericity. Genericity itself is what makes debugging tricky, because it indirects away concrete details in a way that some implementations (both C and OCaml) make hard to recover at run time. The presence of a fancy type system isn't the problem.

[/research] permanent link contact

validate this page