Rambles around computer science

Diverting trains of thought, wasting precious time

Thu, 24 Jul 2008

More rambles of the form “what does X mean?”

At Kathy's talk a couple of weeks ago, the question arose of just what people mean by “scripting language&rdquo. Kathy said that she probably meant “untyped” language. Alan suggested it might mean languages with a read--eval--print loop. In my survey on software adaptation (in submission), I define them as languages characterised by “trade-off of safety and performance for brevity and dynamism”. “Any interpreted language” was also suggested, together with the immediate contradictory suggestion that ML was also a scripting language. I think my definition, although vague, is the best among these -- but then I would.

What about “typed”? Together with “safe”, “type” is a word whose vagueness has begun to irritate me recently. They're both very imprecise, yet a lot of people, not least programming language researchers, will use them with little concern for making their intended meaning clear, at least to a questioning outsider such as me. I'm on the side of Pierce (in his “Types and Programming Languages”) in claiming that “types” are classifications of “phrases” or (my word) expressions in program representations, and are therefore firmly a static entity. It follows that values in a running program do not have types, and that, as Pierce says, “dynamically typed” is a misnomer.

Expressions denoting values do have types, of course, and since most concrete values found in running programs also have trivial literal expressions in most programming languages, it's easy to extend the notion of types to include values in this way. We can talk about the “value set” of a type, by expanding the set of values which could possibly be denoted by an expression of that type, and we can then trivially classify values by whether they are or are not in that set. Personally, however, I find these concepts different enough that I'd rather we had a word other than “type” for this latter property of values... any suggestions?

People often describe C and C++ as “weakly typed”. They sometimes mean that the type system is relatively inexpressive -- true for C but hardly for C++. It can also mean that the type-checker can be manually and explicitly overridden, risking undefined behaviour -- this is true in both cases, and relates to “safety” as I will discuss shortly. Or it can mean that they allow many “implicit conversions”, effectively relaxing the typing rules in some way. Some of these implicit conversions may, in the case of C, risk undefined behaviour -- whereas one of C++'s key improvements is largely to exclude this latter case, by, for example, prohibiting making a pointer from an integer without a cast. This makes making the overriding explicit, and therefore it becomes programmer's problem.

These “conversions” are clearly operations on values, being performed at run time, but also frequently modify the type of the expression in which they statically appear. To understand this, I separate conversions into three main classes: coercions, reinterpretations and value assertions. “Coercion” is yet another word used sloppily, but in my internal lexicon it's an operation which attempts, dynamically, to produce a value of equivalent meaning but possibly differing representation from its argument. The typing rules associated with coercion expressions, such as (float) 42, formalise the implementation-enforced fact that the value resulting from such an expression will always fall within the target type's value set. By contrast, consider another class of expression, which I, like C++, call a reinterpretation, for example (int *) 0xdeadbeef or (unsigned) -1. Since language specifications rarely state much about how values should be represented, these conversions almost always result in implementation-dependent code. Of course, that's not always a bad thing.

The third kind of conversion, which I call a “value assertion”, does not change any run-time representation, but does depend on run-time metadata being available to verify the assertion. The assertion simply checks that a value's meaning satisfies some predicate, where that predicate is normally specified, statically, as a type. So, if the value of the argument is one which would, if expressed statically within a program, satisfy the specified type, then the assertion succeeds. This naturally entails that a valid type of an expression containing a value assertion is the type used as the predicate -- since, if the expression is successfully evaluated, the resulting value will always satisfy that predicate. Checked downcasts are the classic example of this conversion.

This latter conversion is simply an extension, trading a little performance for a little flexibility, of what has historically been the purpose of type systems in programming languages, which is to ensure, statically and conservatively and efficiently, that value representations are never manipulated incorrectly. Here “incorrectly” refers, again conservatively, to manipulations which may result in undefined behaviour. It is this serious class of error -- the kind of error which can manifest itself in arbitrary ways, from segfaults to Byzantine failure -- to which “safety” often refers, especially when people describe C and C++ as “unsafe”. When Vijay Saraswat wrote that “Java is not type-safe” he was referring to a bug in class loading (now long-since fixed) which allowed exactly this class of error to occur.

However, historically is not currently, and this is not the only definition of safety. In the Java-speaking world, where this “undefined-behaviour safety” is now taken for granted, “safety” is used to refer to things variously called “type errors”, ClassCastExceptions or, using my previously outlined terminology, failed run-time assertions. I have quite an interest in language interoperability (though I wish there was a better name for it), and doing related reading, it's this latter kind of error that Wadler and Findler are talking about when they say that “Well-typed programs can't be blamed”. Among my handful of grumbles about this paper, and some of the other programming languages literature, is that the authors bandy terms like “safety”, “types”, “casts” and “coercion” with a vagueness which it shouldn't take a systems researcher like me to correct. In general, it is meaningless to talk about “safety” without also saying to what class of error the safety applies, since one can conceive many classes of error, such as deadlock or starvation or nontermination, from which a program can usefully be called “safe”. I can live with it if you're talking about the traditional kind of safety, i.e. “undefined-behaviour safety”, but if it's another one you really should say so.

(Sometimes, safety can refer to a risk that is not “undefined behaviour” but just “possibly surprising behaviour”. For example, in the context of “type-safe linking”, it would be thought unsafe if I declared a variable as a pointer in one module and as an integer in another module. Yet the C machine model entails that there is a useful set of operations which can be performed both on pointers and on integers. As such, I might write a program in this way whose behaviour is perfectly well-defined. However, it would normally be considered “unsafe” because the behaviour is likely not to be what was intended.)

My other major grumbles about that paper are twofold. One is that they don't actually explain what their target problem is. They want, they say, to “integrate static and dynamic typing into a single framework”. Leaving aside my hatred of the word “framework” for now, if you buy what I wrote at the start of this piece you'll know that there's nothing to integrate -- they are already completely orthogonal. (You could, for example, write your Java code declaring every variable as Object and sprinkling downcasts everywhere. Et voila: “dynamic typing” -- as Wadler and Findler, but neither I nor Pierce, would call it.) It appears that the problem they actually solve is one of proof rather than concept: they introduce a typed calculus including value assertions (just as I described earlier, only they call them “casts”) whose typing rules ascribe one of two flavours of “blame” to each assertion. The flavours of blame are positive and negative, corresponding either to the enclosed (asserted-about) expression or to its context. They prove, for a simple calculus with optional type annotations, that whenever an assertion fails, the blame labels ascribed by the type rules will always trace a path to an expression which “caused” the failure, and that that expression is always “dynamic” in that its type is ``Dyn'', the ``could be anything'' type. This is no doubt a useful contribution, but I wish they had said it up-front, and not confused provability (which is what they enable) from practical implementability (which has been possible for decades).

The second grumble is that their introduction claims to offer a new insight into subtyping, but if I'm understanding it correctly, in fact the insight is into a new and different notion of subtyping, defined in order to prove the property they care about. Unfortunately it is a notion which appears to contradict existing intuitive notions such as the Liskov substitutability principle. Specifically, they extend the “subtyping” familiar to me, a relation on types, i.e. on predicates over expressions, into a relation on predicates over expressions paired with their enclosing contexts. Ultimately it feels as though they are using a logic resembling a type system, but whose predicates range over a different domain (i.e. expressions with contexts, as opposed just to expressions) in order to prove properties which, again, are different from those with which we normally associate typing (the absence of some class of errors). This isn't a bad thing to do by any means, but given the already-large space of meanings for “type” I'd rather it wasn't called a type system. The only possible excuse I can think of is that it works out the same in the case of a whole program: since there isn't a containing expression, the “type” of the whole program is still a predicate over solely the expression which defines that program. But then, whole programs don't really exist anyway, as I may rant about in a future post.

The positive and negative subtyping thing can be thought of as the “type of the expression” versus the “type of the hole”, the hole left when you remove the expression. Normally we only consider the type of the thing, i.e. the expression, by asking: what values might this expression evaluate to? The type of the hole corresponds to a different question: what type of expression, when put in this hole, would make the containing expression well typed? This latter seems like a much more complex question. For example, what if two different expressions, when substituted, both gave well-typed containing expressions, but with different overall types? Unless I'm mistaken, this can easily happen in polymorphic languages, and probably in other cases too---but my brain is finally reaching its current limits as far as type systems go. Perhaps to an expert, this entire post is trivial and unnecessary. But I may as well throw in my final minor grumble: instead of the unintuitive ``positive'' and ``negative'' blame, why not call them ``content'' and ``context'' blame? When reading the paper, I kept getting positive and negative mixed up. Perhaps there is a deeper reason why their choices are apposite. If you, dear reader, can shed any light on anything I've been talking about, or correct my poor understanding in any way, please do let me know.

So there you go: a more-or-less complete dump of my mental models and taxonomies concerning “scripting”, “typed”, “cast” and “safe”, and some literature grumbles to boot.

[/research] permanent link contact


Powered by blosxom

validate this page