When I was working on my thesis, and to a lesser extent, when I was a postdoc, I had the habit of near-daily blogging as a way of thinking aloud. Now that I’m doing a bit more academic work, I thought I might see if I could build a similar habit albeit on a less frequent basis
Three bits of progress today:
span_tokenize
NLTK is a pretty handy thing.
I’m working on a toy discourse unit segmenter for our corpus, not so much because we’re interested in the segmentation per se (we are… but first things first). As an extremely crude first pass, I’m trying something fairly stupid, like “whatever NLTK considers to be a separate sentence, I will treat as a discourse unit”. Using the segmenter is easy enough; just call nltk.tokenize.sent_tokenize
on a string and voilà, you have a list of sentences.
Great, you have a list of segments! Now try folding them back into your corpus…
Trouble arises because this segmentation output loses information, particularly the arbitrarily long bits of text (whitespace) between each sentence. We have the usual borader problem of wanting to integrate different layers of annotation in our text, be they human-authored or the output of third party tools like POS taggers. In short, standoff-annotation is our friend, but to be a good friend to standoff-annotation, we need to be able to grab text spans for all our annotations.
Luckily in the case of the NLTK segmenter, this is not the sort of information we have to reconstitute; it was there all along, but dropped in the easy bit of the API. I did a tiny bit of digging into the API and found that I could use the span_tokenize
method. It’s a bit more fiddly; you have to set up the tokenizer yourself (gist):
import nltk.data
text = "Hello, I am a bit of corpus. Why don't you segment me?"
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
for start,end in tokenizer.span_tokenize(text):
print "%d\t%d\t%s" % (start, end, text[start:end])
# 0 28 Hello, I am a bit of corpus.
# 29 54 Why don't you segment me?
And with spans at our disposal, standoff annotation is just a little bit of interval arithmetic away.
I managed to run the Glozz annotation platform on our corpus data, something anybody who’s been working on the project for a while should know to do. Now that I know my way around the corpus a bit and am not scared off by the demand for a login (just click “anonymous”, silly), I do too.
Mac users may be interested in my Glozz Mac helper script. Just a silly thing. It might be good to patch their code to allow for better native UI integration, something I’ll get around to basically never.
Anyway, this is useful because it lets me check my work for the main bit of progress
I’ve extended my educe library to write Glozz XML files from its annotation tree data structures. The code isn’t very nice, but hopefully will improve as I grow more familiar with Python. I’ve verified that reading into educe data structures and writing back into XML is a round-trip. It’s diff-clean with some minor exceptions. It took extending the model, of course, to account for things like the metadata associated with each annotation.
Next week, I’m going to see if I can emit my NLTK-derived segments in Glozz XML. My first attempt crashed Glozz, so probably a good few things missing here. One of the problems I’m going to need to think about in particular is generating identifiers for things…