Corpus tree surgery

Posted on 18 March 2013
Tags: nlp, corpora, stac, git

Problem: you have a large, unfamiliar corpus under construction which needs to be reorganised. You need to proceed carefully because you’re not entirely sure you understand the problem domain. You also need to account for the fact that the corpus is a moving target.

I can’t claim to have a good solution to this problem, but having worked on it recently, I thought I should share some of the techniques I used and my thinking behind them.

Distributed version control

Cultivate a fussy personality. A little bit of obssesiveness can be helpful here. The kind of person you want to be for this particular job is the kind of person that has opinions about good version control hygiene, that gets upset at meaningless commit logs, that kind of fussy. If this sort of fussiness doesn’t come naturally to you, find somebody for which it does.

Get better at distributed version control. If the corpus is stored in SVN, use git-svn to create a local git copy. This is crucial you’re going to make a large number of mistakes along the way, and you need the ability to review your work before checking it in. As a Darcs fan, I’ll add that at the time of this writing, Git is a much better tool for this particular job than Darcs is (mostly for SVN support).

Work in a branch of the corpus. You may find yourself needing to create multiple branches, for example, one where you perform a first draft of the reorganisation that people can look at. Be careful in your naming conventions. If you have something like corpus-reorg and corpus-reorg2, people will (naturally) assume they should look at the latter (whereas maybe the second one was work in progress, whereas the the former was stable). If the corpus is in SVN, use SVN branches. Git-svn can track SVN branches as remotes.

Scripting

Script the reorganisation effort. Resist the temptation to move things around by hand. Instead, write a script that does these moves (using version control ops, so eg. git mv foo bar instead of mv foo bar). This provides a few things:

DUBIOUS Use script generation rather than writing the script by hand. This is a technique I used by accident. I started the work in Python (as a way of getting to grips with the language), and initially had it in mind to have a --dry-run mode which emitted a log of things it would do, and a live mode which actually did the moves. Somehow the dry run mode evolved to being the real deal, that my workflow was to use my Python script to generate a shell script.

I’m not exactly proud of this, but I think one reason I had this two layer approach is that it allowed me to differentiate my high-level work from the low-level nitty-gritty operations behind them. The thing is that you need to be able to study the log of low-level operations to see if they make sense. It’s also reassuring to be able to save the generated script in version control, as in “this is the actual set of operations I executed when I did the move in r3810”

The downside to this approach is that it can be difficult to reason about. Your script generation code is navigating the file structure and generating a script that manipulates the file structure. So you need to keep in mind that the script generator does not see any of the manipulations. If you generate code to rename a directory, you’re not actually renaming the directory. Likewise, you also need to keep the opposite thing in mind when thinking like the generated script. This difficulty in two-layer reasoning probably makes it not worth the trouble.

Bits and pieces

Navigation

Comments