UIMA annotation model

Posted on 12 August 2013

About

As part of my work on the educe library, I’m doing a small survey of annotation models used in a handful of text processing and annotation frameworks. So far, I’ve covered Glozz, the annotation tool used in our project; and GATE, a pluggable text processing framework.

The next model of interest is the one used by UIMA, the Unstructured Information Management Architecture. I tend to think of UIMA as being a generalisation on GATE. It was initially developed by IBM in 2001 to improve the interoperability of text (or other modal) analysis components scattered throughout IBM at the time, and apparently to good effect, if you recall IBM Watson defeating a couple of human stars at the gameshow Jeopardy! (2011). Somewhere in between UIMA was added as an Apache project and has become an OASIS standard to boot.

Type systems

My first impression of UIMA is a sense of extreme generality: you can have annotations on any content using any mode communication, eg. text, audio, video. Behind this generality, is the fact that the UIMA architecture is design around object oriented type hierarchies: for whatever annotations you make, you define your own classes and subclasses, but they must conform to some minimal specification.

Confusingly enough, the UIMA documentation uses the word “type system” to refer to a specific set of classes and their relationships, as to the usual programmer understanding of it as being the overarching typing rules and features (eg. if your type system has subtyping, generics, etc…). In the UIMA world, the idea is that each different kinds of documents have different kinds of annotations, and thus different “type systems” (set of classes), and if you want to share annotations, you provide an encoding of your type systems to go with it. So how does UIMA model such type systems then? As with many other OO languages, it makes a distinction between entities on the type level (classes and features), and those on the data level (objects and slots):

classes: single inheritence, have multiple features
features: have a type (a class), and cardinality (lower and upper bound); features come in two varieties: attributes, whose type must be one of the primitives classes; and references, whose type is a class
objects: instances of classes, has a slot corresponding to each feature defined by its class
slots: a pair of a feature and some values (a list?). The number of values and nature of its contents are determined by the feature’s cardinality and type. If the feature is an attribute, the values must be primitives; likewise if it’s a reference the value must be objects

Note that this model can be captured in the Eclipse project’s ECore modeling language ¹, which also provides a file format to express UIMA “type systems”.

Primitives (base type system)

UIMA in its great generality does not mandate the use of any type system in particular; however it does provide a base type system on which other type systems can be derived and which is assumed to be common across all UIMA-compliant analytics, applications, and frameworks (sounds effectively mandatory to me!). The base type system includes some primitives (strings; bools; bytes; 16, 32, 64 bit ints; 32, 64 bit floats), plus the EObject type subclassed by everything including primitives.

Annotations (base type system)

Aside from the primitives, the base type system has the notion of an annotation. An annotation in the UIMA world is

an object (ie. arbitrary)
with regional references (eg. offsets)
on an subject of analysis, “sofa” for short (eg. some text)

In other words, we have the usual notion of standoff annotation on a document using offsets, but generalised so that instead of text we’re working on “sofas”, and instead of offsets, we’re working on “regional references”.

In the UIMA base type system, these annotations are represented with the Annotation and SofaReference classes. Basically each annotation has a pointer to the content it’s annotating. The SofaReference serves as that pointer and can be shared by annotations, which keeps things nice and factored out. Of course, merely pointing is far too sparse a notion of annotation to be useful in practice, so you would want to subclass it and supply the actual annotation data as well as the offsets.

Putting it together: CAS

So far we’ve discussed the abstract building blocks for UIMA-style annotations, basically a small language of types (classes/features), a handful of primitive types, and an annotation/pointer generalisation of standoff annotation. These blocks can be assembled together to form a a CAS (Common Analysis Structure), a sort of self-contained bundle consisting of

a document (called artifact in Uimaese, since it handles any kind of unstructured data, eg. video), perhaps represented as a URI?
annotations and metadata (UIMA considers annotations to be part of the document metadata)
the type system used by the annotations, perhaps represented as a link

There are a couple of details worth noting. First, keeping in mind the extreme generality of UIMA, these artifacts, may not necessarily be single documents, but could also potentially be multiple objects (of different modalities even), for example perhaps an audio file and the text transcription it is paired with. At the end of the day, all we have is as object representing some content, and some annotations that refer to pieces of that object. Second, as sort of general design pattern, UIMA suggests that annotations be grouped together into a set of views on their artifacts, which provide a particular interpretation or perspective on that artifact (for example, secret vs open annotations on a document, annotations on different language translations of it, or even just a subset of the annotations like “all the pronouns”). The notion of a view seems like a relatively minor point, but it’s useful to know in general that the pattern is there.

Overall, CASes are basically what UIMA components produce and consume; they have a standard XML representation based on the XMI specification. Perhaps given its generality and its reliance on pre-existing standards before it, it would be good to consider the UIMA format for storing CASes as an alternative to inventing a new annotation format? There would be a bit of overhead to pay, particularly finding or defining a type system to capture the annotations being used, but it sounds like this is work that would have to be done anyway.

Questions

I’m not entirely sure what lessons I can draw from UIMA for how to understanding how to do and represent annotations. I suppose the idea that offsets can be generalised into some unspecified notion of Reference is useful, and that being explicit about types makes sense.

Otherwise, I’m not entirely clear about what UIMA has to say (if anything) on issues related to how annotations are exploited/combined, and how annotations from different sources interoperate. The explicit type system has something to do with it. Presumably, the ability to publish your types and the desire for interoperability puts a little bit of pressure on folks doing similar kinds of annotation to converge on the same type systems.

I’d also be curious if UIMA has anything to say in particular about complex annotations, for example annotations on annotations (as with relations and schemas in Glozz), or hierarchical structures like parse trees. Searching around a bit, I found this paper on an fairly extensive looking UIMA type system for clinical NLP applications (note also its implementation in the cTAKES system), which seems to cover syntactic structure and provides a notion of relations. Here, annotation objects just straightforwardly point to other annotation objects in their slots. This seems very useful to build off and is the sort of thing that the UIMA model would allow for, but I do wonder how it fits into for example the idea that annotations should have a subject of analysis, and a regional reference of some sort. Does the same apply to these annotations in a sensible way? Is the sofa of a parse tree constituent some piece of text, or some annotation, or something else?

Summary

What makes the UIMA annotation model distinct as far as I can tell is (1) an effort to be as general as possible, through (2) explicity notion of an extensible type hierarchy for annotations which (3) is encoded/exchanged along with the annotations themselves. In other words, UIMA annotations come with blueprints.

Comparing UIMA with other annotation models, I find that the four pivot points I’ve been using so far are not as helpful given UIMA’s agnosticism about everything.

substrate : unspecified (can be anything)
typology : annotations are arbitrary objects with slots (features), constrained by a type system local to its CAS
spans : unspecified
features : typed (either a primitive or some reference type), can can be associated with multiple values

The layering seems to be something along the lines of

CAS : document(s) (artifact), annotations, metadata
annotation: some features, reference to content
features : rich attribute-value pairs

Also, the documentation mentions a couple of case studies demonstrating the generality of UIMA. The UIMA and GATE have produced an interoperability layer for example, making it possible to transparently recycle a UIMA analytic as a GATE processing resource and vice-versa. The UIMA folks also did a similar experiment importing OpenNLP components, although with a bit less generality (each component had to have its wrapper defined separately).

Acknowledgements

Most of my understanding of UIMA comes from a combination of the Ferrucci et al.’s 2006 Towards an Interoperability Standard… IBM Research Report, and a higher-level overview from Ide and Suderman’s 2009 Briding the Gaps….

I believe ECore is a superset of the UIMA object model in that the former supports multiple inheritence↩