Diverting trains of thought, wasting precious time
In a recent chat with my internal examiner, Andy Rice, I had a few thoughts which I decided to write down. It turns out he reads my blog---along with (his words) “everyone in the department” ---so, hi Andy and everyone. One day I might stop writing as if my audience consists only of myself, but not right now.
In summary, I want to rant about two weird things that go on in the research world. One is that there are some memes that seem to have a real influence on how PhDs are examined, but seem to have no origin other than folklore, and are different from the standards used to judge other research. The second rant, and perhaps the more oft-repeated, is that we actively encourage boring research.
(I should add that although this post is rather ranty, the chat was not an argumentative one. So, this is mostly post-hoc ranting about related topics, and not a direct reflection of our conversation.)
A thesis is judged on criteria from folklore, beyond what applies to “normal” research. At various points in my PhD, I heard it said that “a thesis should... [do X]”. Usually, X was something to do with telling a complete story, strongly substantiating a succinct hypothesis, and so on. And now I have heard the same from my examiners. Unfortunately, these statements continue to be just that---hearsay. They're different from the ways in which other research is judged. There are no regulations or official guidance to support them. There are no clear scientific or moral justifications for them either. The research community happily publishes many papers that do not tick these boxes, and at good venues. My own OOPSLA '10 paper is one example, but there are lots of others. But despite this, PhD examination seems to give a lot of currency to these criteria, for apparently no reason other than their having been handed down through the generations.
During my PhD I didn't worry myself much about this, since, like most researchers, I don't put much weight on unsourced claims. Besides, there seemed to be enough data downplaying their significance anyhow---several other theses seemed to break the rules, and plenty of published, respected research papers did too. Surely if a PhD is training for research, the qualifying criterion should be focused on doing good research? From my very limited experience, and from what I gather from listening to others, this is not how things currently are. Fortunately, I am of the bloody-minded type. I was aware that I might be “creating trouble” for myself, but I personally preferred to risk creating that trouble, thereby at least gathering some evidence about it, rather than swerving to avoid an obstacle that was at best nonexistent (I didn't know it would cause trouble) and at worst, worth challenging. So, consider it challenged! If you think a thesis needs to be anything more or different than good research, I challenge you to justify that position.
Now, on to my second rant. The evaluability problem has an irrational hold on many practical computer scientists, to the extent that research into many important problems is deliberately avoided. I spoke to many experienced researchers about my PhD work as it went along. Several of them suggested that I might have some trouble at examination. This seemed odd to me, for the reasons I just ranted about. Nevertheless, I didn't disbelieve them. But I had no intention of applying the fix they suggested. Rather than developing an alternative evaluation strategy or (the best advice in hindsight) to maximise the persuasiveness of the presentation of whatever evaluation data I did have, the only “advice” I ever received on this point was a not-so-veiled encouragement to abandon my current problem and work on something else. “Up and to the right” was what one researcher told me---about the kind of graph that should be in my evaluation chapter. (My evaluation chapter has no graphs, and is staying that way.)
This attitude is the tail wagging the dog. If a problem is important, and we do some research that is not conclusive, we should damn well work harder at it, not give up. The problems and curiosities of humankind are not regulated by how easy it is to collect data and draw graphs about them. If we avoid working on important but difficult-to-evaluate problems, or discourage such work, it shows the worst kind of ivory tower mentality. It is far from a pragmatic position, despite how (I'm sure) many of its adopters would try to spin it. What is pragmatic about ignoring the real problems?
I'm not downplaying the importance of evaluation. It goes without saying that measuring the value of innovations is important. Moreover, our ability to measure is something we need to work on actively. After all, many of those physicists and other “hard” scientists seem to spend nearly all their time working out ways of measuring stuff. So I'm completely in favour of rigorous evaluation. On the other hand, I'm not sure that a lot of evaluation that currently passes muster is really rigorous anyway. We need to recognise evaluation as a problem in its own right, whose hardness varies with the problem---and make allowances for that. For many hard problems, evaluation of a solution is comparably hard. That shouldn't mean that we give up any attempt to tackle those problems. The preference for conclusive results in published research has a deceptive influence, being essentially the same phenomenon as the “decline effect”, described in this very interesting article from the New Yorker.
There are some other problems with evaluation in particular kinds of CS research. One is what I call “evaluation by irrelevant measurement”: if you develop something that is supposed to help programmers, but you can't measure that, how about measuring its performance or proving its type-soundness? It says nothing about whether you've achieved your goals, but it still ticks those evaluation boxes. And of course we have a big problem with reproducibility of experimental results---at the VMIL workshop at SPLASH, Yossi Gil gave a great talk about the non-reproducibility of VM-based microbenchmarks, and Jeremy Singer's Literate experimentation manifesto was a nice counterblast to the wider problem.
I have found programming language researchers to be more sympathetic than “systems” researchers to work “towards” a goal, as distinct from work telling a complete story about some problem. This is partly because the nature of programming language research makes reliable evaluation a very high-latency endeavour. In other words, until real programmers have used your idea in a large number of projects, there will be no clear experience about how well it works. So, being computer scientists, we mitigate that latency, using pipelining. Rather than a slow stop-and-forward algorithm which waits 20 years between research projects, we have to be more amenable to two approaches: argument, in the sense of paying attention to the reasoning that justifies the approach of a particular piece of work, and speculation, meaning allowing the research discourse to explore many alternative approaches concurrently, and letting time tell which ones will “stick” out of the many that have been given a chance. The job of the researcher is less to conclusively show a problem as solved, but to show that a technique is feasible and has some potential for wide and successful application.
Going back to the first point, perhaps I should add that I'm not saying that my thesis would have stood up any more strongly by “good research” criteria. But having said that, a very large chunk of it appeared at a top-tier venue, so it can't be all that bad. Both of my examiners seemed to miss this fact, so the lesson is: always put a prominent summary of your publications in your thesis! Personally I can be very critical of my thesis work. But it seems bizarre to me that folklore should have so much sway in the way that theses are examined.
[/research] permanent link contact