Glozz annotation model

Posted on 3 April 2013
Tags: stac, glozz, corpus

About

To help build the educe package, I’m doing a small survey of the annotation models used by other tools. The hope is to find something I can just reuse, saying “here, we use the Foo model” (and format), thereby saving work and gaining potential interoperability with bits of the research world.

Barring that, I at least want to build some awareness what kinds of decisions I’ll be making (sometimes without knowing I’m making them). If I can make them on some principled grounds, it increases my chances of future-proofing the work, not painting myself in a corner with some fatally inflexible decision, or conversely creating a conceptual mess through insufficient constraints.

I’ll probably have to revise pieces of this survey as I go along, hopefully getting a clearer idea what sorts of questions I should be asking myself about model as I go along.

I have four things in mind, Glozz, GATE, UIMA and a paper by Bird and Liberman that NLTK mentions. Since we’re using Glozz as our annotation tool, I’ll start with that.

Structure

At the very bottom of the Glozz model, we have text (presumably a sequence of Unicode characters). On top of that, you have three kinds of annotation:

units
a contiguous span of text
overlapping, covering, and same-span units are OK
relations
a link between two annotations of any type (ie. unit, relations, schemas)
schemas
a link across an arbitrary set of annotations

It’s worth noting the flexibility you get by virtue of relations and schemas being able to point to annotations of any type, making it so that you can have eg. relations between schemas and units, relations and relations, etc. This sort of things sounds like it might be useful in the context of SDRT (Glozz was developed in an SDRT annotation context, if I understand correctly), where you have graphs formed of relations on utterances; but also relations on graphs themselves. By default, I’ll assume flexibility (in a model) is a good thing and that it will be up to applications to impose sensible constraints appropriate to their needs.

Payload

Of course, if you’re annotating text you don’t just want to have connected bits of highlighted text, but the ability to say things about those bits.

All Glozz annotations (unit, relations, schemas) are labelled with a set of attribute-value pairs. Features are not recursive (so we can think of the values as just strings, for example). Attributes can be constrained to a particular set of values if wanted (eg. colour ∈ { red, blue, green }), but can also be free form.

One thing not discussed if an attribute may be associated with more than one value (e.g. indicating disjunction, (‘color’,‘red’), (‘color’,‘green’). I’m guessing not since they’re called “features” (as in “feature structures”?) and also from some of the stuff we’re doing in practice.

Code sketch

This is a rough sketch in Haskell of what I think the Glozz annotation model looks like. I’m not that great at modelling, and hope that I’m not mixing up core issues with more implementation-level issues. For now I think it should work as a snapshot of my understanding:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- | A document has some text and a mapping of identifiers
--   to annotations
data Document = Document Text (Map Guid Annotation)

-- | Annotations can be either units, relations, or schemas
--   All annotations are associated with some labels ('AnnoData')
data Annotation =
    Unit     TextSpan       AnnoData
  | Relation (Guid, Guid)   AnnoData
  | Schema   (Set Guid)     AnnoData

-- | Character offsets within a text
type TextSpan = (Int, Int)

-- | Every annotations has these in common
data AnnoData = Anno Guid FeatureSet Metadata

-- | The Glozz manual says that identifiers used are unique to
--   the world (more on this below)
type Guid = Text

-- | I'm assuming you can't have repeat attributes
type FeatureSet = Map Text Text

-- | Not really mentioned in the doc
newtype Metadata = Metadata FeatureSet

Identifiers

One detail that may be of interest is the notion of identifiers. By identifier, I think I mean a name, a tiny bit of data that allow us to distinguish bigger blobs of data from each other. In the Glozz model, all annotations are associated with an globally unique identifier. The manual provides Glozz’s approach to generating identifiers (which I trust are opaque, ie. not parsed). Glozz-generated IDs are the tuple of annotator id and Java timestamp (ms since 1970-01-01T00:00Z), eg. ymathet_1290167040405, with global uniqueness seemingly based on

This approach may be problematic from the standpoint of automatically derived annotations:

  1. A central registry of recognised tool names would not be flexible enough in practice (would need to be amended every time you want to introduce a new tool, does not lend itself to small variants between say different parametrisations of a tool). Perhaps highly hierarchical naming conventions à la Java package would do the trick.

  2. Automatically derived annotations can be made at sub-ms resolution.

  3. It would be really nice if identifiers had a stability property, ie. that if you run the same tool on the same data, you should get the same exact results. Spurious trivial differences mean that you won’t be able to easily tell if two sets of data are the same, at least not with simple generic tools like diff.

Stepping back from this, I guess it’s important not to take the globally too literally, and that what you’re working with is an inherently restrained task-based scope. You need only be as global as the set of objects that will ever co-exist. So within an single corpus with multiple documents, yeah you want the annotations within these documents to be unique to the corpus; but you don’t care if they are globally unique so long as you never merge with other corpora. For automatic tools, the trick is to have a mechanism for generating ids that avoids timestamps (I use a counter), but which also provides some assurance of corpus-wide uniqueness (I use the document name).

Summary

Recapping the Glozz model as I currently understand it

Navigation

Comments