Rambles around computer science

Diverting trains of thought, wasting precious time

Thu, 17 Mar 2011

Everything is a somehow-interpreted file

My last post was ranting about the difficulty of maintaining and debugging configuration of complex software. I somewhat naively advocated the position of using the filesystem to the greatest degree possible: as the structuring medium as well as a container of data. This is good because it maximises the range of existing familiar tools that can be used to manage configuration. But at least in some cases, the filesystem---as an implementation, although sometimes as an interface too---is manifestly not good enough (e.g. in terms of space-efficiency for small files). Instead, we want to make our old tools capable of accessing, through the filesystem interface they already know and love, these diverse and wonderful implementations of structured data.

I have heard many researchers claim that binary encodings of data are bad, that XML is good, or even that revision control systems are bad ways of storing code, for the same recurring semi-reason that good things are things that conventional tools work with---sometimes the phrase “human-readable” crops up instead---and bad things are things they don't work with. You can search XML or plain source code using grep, or so the reasoning goes; you can't generally do the same for binary or diffed-text or otherwise munged encodings. This argument is classic tail-wagging-dog material. Plain text is only “human-readable” because there is some widely deployed software that knows how to decode binary data representing characters (in ASCII or UTF-8 or some other popular encoding) into glyphs that can be displayed graphically on a screen or on a terminal. If other encodings of data do not have this property, it's foremost a problem with our software and not with the encoding.

Unix is arguably to blame, as it is obsessed with byte-streams and the bizarre claim that byte streams are “universal”. The myriad encodings that a byte-stream might model are less often mentioned. I'd argue that a major mistake of Unix is a failure to build in any descriptive channel in its model of communication and storage. Without this, we're left with what I call the “zgrep problem”: each file encoding requires its own tool, and the system offers no help in matching files with tools. If I have a mixture of gzipped and uncompressed text files in a directory---like /usr/share/doc on any Debian system---recursively searching is a pain because for some files we need a different grep than for others. Some approach combining analogues of HTTP's Content-Type and Accept-Encoding with make's inference of multi-hop file transformations, could allow the operating system to transparently construct the correct input filters for searching this mixture of files from a single toplevel instance of our plain old grep tool.

For this to work we need an actually sane way of describing data encodings. (Call them “file formats” if you will, although I prefer the more generic term.) Later attempts at describing the structure of byte-streams, notably file extensions or MIME types, have been ill-conceived and half-baked, mainly because they are based on a simplistic model where a file has “a type” and that's that. So we get the bizarre outcome that running file on /dev/sda1 says “block special” whereas taking an image of the same disk partition and running file on that will say something like “ext3 filesystem data”. Doing file -s might be the immediate workaround, but in reality, any given filesystem object has many different interpretations, any of which may be useful in different contexts. A zip file, say, is a bunch of bytes; it's also a directory structure; it might also be a software package, or a photo album, or an album of music.

Another interesting observation is that these encodings layer on top of one another: describing a way of encoding an album of music as a directory structure needn't target a particular encoding of that structure, but presumes some encoding is used---any that conceptually models the album encoding is sufficient. So we want to capture them in a mixin-like form, and have some tool support for deriving different compositions of them. What would be really neat is if a tool like file, instead of doing relatively dumb inode-type and magic-number analyses, actually did a search for encodings that the file (or files, or directory) could satisfy. Each encoding is a compositional construction, so the search is through a potentially infinite domain---but one that can usually be pruned to something small and finite by the constraints provided by the file data. But by giving it more mixins to play with, as configuration data, it could find more valid interpretations of our data. This sort of discovery process would solve the zgrep problem in far more complex cases than the one I described. Implementing (and, er, somehow evaluating) these ideas might make a nice student project at some point.

These ideas have been with me right from my earliest forays into research-like work. I've yet to really nail them though. My bachelor's dissertation was about a filesystem exposing arbitrary in-memory data. And my PhD work addressed some of the problems in making software accessible through interfaces other than the ones it exposes as written. I've yet to do anything with persistent data or other IPC channels so far, but it has been on the list for a long time. (It's become clear with hindsight that this agenda is a characteristically postmodern one: it's one of building tools and systems that provide room for multiple interpretations of artifacts, that don't insist on a grand and neatly-fitting system being constructed in one go, and that accommodate incomplete, open or partially described artifacts.)

The power-management problem that provoked my previous post actually gets worse, because even when not on AC power, closing the lid only usually makes the laptop suspend. Sometimes it just stays on, with the lid closed, until the battery runs down. For my next trick, I may be forced to come up with some bug-finding approaches that apply to scripts, filesystems and the like. If we're allowed to assert that a file has a particular abstract structure, and check that separately, then we can probably factor out large chunks of state space from real software. In turn, that might shift our focus away from “inputs of death”, fuzz-testing and other crash-focussed bug-finding techniques, and towards the harder but more interesting ground of finding functionality bugs. I'll rant about that in a forthcoming post.

[Update, 2011-3-18: I just ran into an interesting example of this same debate, where Ted Ts'o is actively advocating using a database in preference to maintaining lots of small files, for performance and robustness reasons, in relation to ext4's semantic differences from ext3. Understandably, he provokes a lot of discussion, notably here, where people complain that binary files are difficult to maintain. There is a mix of views and a wide variance of clue level, but the real issue is that no one interface---where “interface” includes robustness and performance characteristics--- is the globally “right” one.]

[/research] permanent link contact

Powered by blosxom

validate this page