koweyblog - All posts

Concept diagrams prototype

2014-06-27T00:00:00Z

Concept diagrams prototype

Posted on 27 June 2014

For the past 6 months, I’ve had the pleasure of working with the University of Brighton Visual Modelling Group. With the few remaining days of my time, I thought I would write a bit about my work: an ontology editor prototype based on a visual logic language.

Note This will be a fairly lengthy and detailed post covering three Eric-months of work (I was half-time). What I had initially intended as a quick little blog post quickly ballooned into an informal writeup for any future inheritors of this project. You may instead be interested in viewing the screencast, which provides an overview of what the editor does.

Background

The VMG are a friendly group of maths and computer science researchers who are working on visual languages and fleshing out the maths and logic behind them. The work they do combines maths and theoretical computer science (formal properties of the languages) with some human computer interaction research (making visualisations as clear and communicative as possible). One avenue they are exploring in their work is the use of visual languages to represent ontologies. With some theoretical work under the belts, the VMG were looking to further advance their research by producing a working tool that people could use and provide feedback on.

Ontologies and concept diagrams

An ontology is a formal way of describing a part of the world, often as a set of classes (eg. Dog) and their relationships with each other (eg. Puppy ⊑ Dog). There are load of interesting applications for ontologies, as tools for thinking, communicating, or reasoning about about the world. You can imagine ontologies playing a role for example, in an interactive biology textbook, or medical diagnostic assistants, or an aircraft maintenance guide; all building off complex knowledge about how things interrelate. The downside to all this useful formally structured knowledge is that it can also be extremely complex and detailed; when we make explicit the many facts we would otherwise take for granted, the end result can often be an unwieldy mess. And even if we say that a given ontology is intended to be consumed by computers only, it still has to be built and maintained by humans.

So how do you take a big blob of formal structure fit for human consumption? One possibility the VMG propose is to make use of concept diagram, a formal graphical notation with classes represented as shapes (circles or rounded rectangles), properties as arrows, individuals as dots, and along with other notational details (labels, shading, a representation of the universe). For example, you might represent the class Dog as a rounded rectangle:

To represent the axiom Puppy ⊑ Dog (the class Puppy is subsumed by Dog), you would nest a smaller rectangle within it:

Likewise you might say Dog ⊑ ∀ wag. Tail (dogs wag only tails — such is the trickily precise world of ontologies) with a combination of nested shapes and an arrow:

It may be worth mentioning that what makes these diagrams interesting isn’t so much that they are pictures, or even particularly intuitive at first glance. In fact, being defined by a set of precise rules, concept diagrams actually require a bit of training on how to correctly interpret them (for example, a dashed arrow S ⤏ T asserts that some binary relationship exists between individuals in the set S and all individuals in the set T; and a solid arrow S → T strengthens this to say that it exists between all and only the individuals in T). Having such rules behind the diagrams means that we can more easily link them to the equally formal systems that define ontologies. But as a broader point, mass accessibility need not necessarily be what makes visualisations useful. Perhaps over time, the concept diagram will prove to provide enough clarity and communicative power to make it worth the learning curve for people working with ontologies.

WebProtégé

Having fleshed out and pinned down the behaviour of the concept diagram formalism, the VMG wanted to turn their LaTeX and whiteboard thinking into actual working code that people could try and use to build ontologies, which is where they brought me on board ¹.

One of the technical goals of the project is for the plugin to work with the WebProtégé ontology editor (You may be more aware of the desktop editor Protégé, which while more mature, is slated to be phased out in favour of its web cousin.) WebProtégé is a powerful and extensible ontology editor that offers a wide range of ontology-editing components that can be added to and removed from the editor according to need.

The basic class and property components present a collapsible tree structure (to convey the subsumption hierarchy), which users can modify by dragging classes from one part of the tree to another. Selecting a class opens it in an class editor widget which allows you to modify its properties (relationships with other classes) and annotations. There are also more powerful components like the new OWL Entity Description Editor which make it possible to express more complex axioms such as the Dog ⊑ ∀ wag. Tail. From a user interface standpoint, this division of WebProtégé into essentially standalone components makes it natural for us to develop the concept diagram prototype as yet another component that users could select.

Design

Pattern library

Concept diagrams can require a bit of training to use correctly, and it sometimes takes a bit of caution to express what seem like a straightforward idea at first glance. For example, it may be tempting to express the idea that members of the Dog and Tail are wag-related with a simple arrow:

But which dogs and which tails? Do dogs wag other non-tail things? Are tails wagged by other non-dog things? What exactly are we saying? A more precise assertion would using the All Values From object property expression, as in Dog ⊑ ∀ wag. Tail. The pattern for expressing All Values From is somewhat more complicated. In addition to the two curves and an arrow, we have the arrow pointing to a smaller shape within the Tail (because dogs don’t wag all the tails in the universe, eg. those that belong to cats), but they wag all of an unknown subset of them:

To steer users around these sort of notational pitfalls, the VMG felt that it would be best to encapsulate common ontology idioms in a set of ready-made patterns that one would only have to instantiate. For example, there might be an All Values From pattern with a Class1, a Class2, and a property label to apply: ²

As we are working within the context of a prototype with only three patterns implemented, we could straightforwardly express the notion of pattern library by simply reserving a side of the canvas for patterns, leaving the rest for diagrams proper:

This interim measure will need to be revisited in the future as the pattern library grows. For example, the pattern library may need to have a scrollbar, or as the number of patterns grow, some sort of thumbnail and magnification-based idiom, or perhaps pages or sections. How exactly to express this depends partly on the number of patterns we would end up with and also how pattern instantiation works. Note that that one minor technical challenge to overcome in a more sophisticated pattern library would involve dragging and dropping shapes from the library onto the main canvas. We currently skirt around the issue by simply having the library be part of the canvas, but once they are in separate logical spaces, managing the seamless transition of an object from one space into another may become a challenge.

Idioms (dropping vs. snapping)

To say that the user interface should be based on patterns is not enough. What exactly should a pattern do? How do we interact with it? Our current approach provides for two pattern idioms: drag and drop vs. drag and snap. Drag and drop is currently only used for class instantiation, the most of basic of patterns. Here, the user selects a curve in the pattern and simply drags it out onto the canvas. This creates a new anonymous curve on the screen which can then be assigned a name, resized, and moved around:

Dragging and dropping works well for class instantiation because it does not require any interaction with objects already on the canvas. But for more complex patterns, we often want to express relationships between objects, and some of those objects may already be part of the diagram in progress. A simplistic drag and drop interface for such patterns may be cumbersome to use because of the redundancy it creates with pre-existing curves. One of the curves would have to be deleted. Alternatively, one could envisage a piecemeal drag-and-drop approach, where the user selects only parts of the pattern to drop in, leaving the already instantiated elements behind. But the pattern in such a scenario serve only as a reminder and provide no guarantees that the diagram is being built correctly.

Our current approach for more complex patterns is to use a “drag and snap” idiom. Here the pattern is composed of multiple user-selectable “endpoints” where each endpoint is dragged on to a pre-existing curve in the canvas and “snaps” into place (a bit like unification, to give an analogy). Once all endpoints of the diagrams are snapped into place, the pattern is considered complete the actions will trigger:

The drag-and-snap idiom generalises at least well enough for two somewhat different patterns (Subsumption and All Values From). However, there remain some potentially troubling open issues at least in the current incarnation of this idiom. Broadly speaking, it’s not immediately obvious how to interact with patterns using such an idiom. Looking over their shoulder, I’ve noticed users trying to pull drag-and-snap shapes out onto the canvas and trying to treat the endpoint curve as though it were a normal curve that you might see elsewhere on the canvas (in effect, treating drag-and-snap as though it were drag-and-drop). Likewise, if the pattern provides search boxes (to narrow down the snap candidates for a given endpoint), users can be strongly tempted to use them to name the curves. This can be avoided perhaps by labelling the search boxes with a magnifying glass icon, or by centralising them to a single global search box that always selects for the currently selected endpoint. As for the broader confusion, making the drag-and-snap endpoint more visually distinct (dashed borders) seems to help somewhat. Another solution may be to automatically reset endpoints that are brought out to the canvas but left unsnapped, thus avoiding the possibility for lingering endpoints to be misread as part of the diagram.

Usage

The plugin is organised with a pattern library sitting besides a canvas. The current library is fairly small, providing for three basic patterns:

class instantiation (drag-and-drop)
subsumption (drag-and-snap)
all values from (drag-and-snap)

There is actually a much wider range of ontology-editing patterns proposed in the concept diagrams canon, but my hope is that many of these will be variations on the three above and relatively easy to implement.

Adding things to the canvas involves dragging something from one of the patterns onto the canvas. For class instantiation, this consists in dragging and dropping a curve out onto the canvas. Once instantiated, a curve can be moved around, assigned a label, resized, and deleted. Each curve stays on the canvas, fairly nondescript, until it is moused over. At that point, it is highlighted and various widgets come into focus (the label transforms into an editable text box, and a delete and resize button appear nearby the class). For the other patterns, we instead drag curves from the pattern and “snap” them onto one of the curves already on the canvas.

Patterns

Class instantiation

Class instantiation works by pulling a curve out from the Class pattern onto the canvas (click images to play below). Assigning a label to the curve triggers an ontology update (an equivalent class is added to the ontology). Likewise, changing or removing the label also updates the ontology. In the other direction, some updates to the ontology are reflected in the diagram (removing or renaming a class); however, some updates (adding a class) are not. Much of the prototype is in this state, where there is a working front end that triggers updates to the underlying ontology; but not all updates from the other direction are supported.

Note that in the current implementation of the prototype, multiple curves are explicitly allowed to share a label (and correspond to the same class) if they do so. This makes renaming and deleting curves potentially subtle. When renaming or deleting a curve, we have to check if the curve being renamed is the only one of its kind (we must refrain from updating the ontology if not), or if the new name for the curve already exists in the ontology.

Subsumption

Subsumption is expresses the idea of generality, that members of the subsumed class are also members of the subsuming one (eg. we write Dog subsumes Puppy as Puppy ⊑ Dog). The pattern uses the more complex drag-and-snap interaction with two endpoints, the outer subsuming curve, and the inner subsumed one, both of which are to be snapped onto an existing curve on the canvas. The subsumption pattern allows the user to snap endpoints in any order, resulting in the same updates to the underlying ontology. The visual effect is partly dictated by the last endpoint to be snapped. If the last endpoint represents the subsuming curve, we make a copy of the first curve and place it inside the second one. Conversely, if the last endpoint represents the subsumed curve, we copy the first curve and place it around the second one.

In its current state, the subsumption pattern comes with a hefty list of caveats. Just as with class instantiation, we only have a one directional link between the plugin and the ontology. Using the subsumption pattern triggers an ontology update; but updating the ontology does not trigger any reaction from the plugin. For that matter, the plugin does not detect free form subsumption events whereby the user drags and resizes one curve inside another without invoking the subsumption pattern. Also the fact that we copy curves (and more generally, allow multiple curves per class) greatly hurts the readability of the diagrams. I’ll say a few more words about these shortcomings at the end of this post, both the lack of free-form subsumption detection and the very unfortunate duplication of curves.

All Values From

The All Values From restriction in OWL (a standard ontology language) describes an anonymous class consisting of all individuals whose value for a given property belongs to a specific class (for example ∀ wag. Tail refers to all individuals who wag tails if anything at all ³). As we saw in an earlier example this restriction plays a role in one of the concept diagram patterns where we use an arrow to express the idea that a given class fits under this restriction (eg. Dog ⊑ ∀ wag. Tail):

The bulk of this pattern borrows from the drag-and-snap idiom in subsumption. The user controls two endpoints corresponding to the source and target side of an arrow (eg. Dog and Tail). When both endpoints are snapped, the plugin creates an anonymous curve inside the target one and points the source curve to it. The resulting arrow has a text entry that can be filled with a property name, which (similarly to how classes are added to the ontology when named) once entered triggers an update to the ontology backend, creating the property and class restriction as needed and expressing the appropriate subsumption relation (we can see the results in the OWL Entity Description Editor provided by WebProtégé).

As with subsumption, the All Values from Pattern mostly serves as a proof of concept with holes and missing functionality in critical places. The relationship with the ontology is once again unidirectional (applying the pattern updates the ontology, but the plugin does not notice changes made to the ontology from the outside). Likewise, perfectly reasonable actions such as moving the newly created anonymous curve around are not noticed by the plugin. Also we currently do not provide any mechanism to modify the property arrows (neither detach nor delete), the somewhat slim silver lining for this being that at least avoids there being another issue about keeping the ontology up to date.

Saving and loading

The plugin supports a limited notion of persistence. Changes to the ontology itself are already saved and loaded for free by WebProtégé. When opening a given ontology, the plugin checks to see if it has stored a representation of the diagram on the server and populates the canvas accordingly. Unfortunately, diagrams are not yet saved automatically — the user has to press a button to trigger a save — but it would be quite important to implement such a feature (at the moment, saving sends over the information needed to reconstruct the diagram in one batch; a better approach would likely send incremental updates on each manipulation). Also, persistence has not yet caught up with the more recent work on properties (eg. in the all values from pattern), so it only saves and loads curves/classes, and not labels.

Overall, the mismatch between the complete and invisible ontology saving supplied by WebProtégé, and the partial manual-only saving provided by the plugin means that the state of the canvas will tend to be out of synch with the ontology. The save functionality mostly works only as a convenience for testing the plugin and will need to be extended to be genuinely useful.

Development

My development process was very much exploratory and incremental, with a focus on hitting a series of concrete milestones (selecting further milestones along the way), and discovering what the core issues were along the way. This development style lent itself to the somewhat fast and fluid situation where (1) we had only a vague idea what the prototype should do and so needed a concrete starting point from which to hold a more productive discussion on the design (2) were working with limited time and unfamiliar technology and therefore wanted to minimise the amount of upfront cost to the project. Most of the development was driven first by implementing the user interface for a new feature, mostly to further develop ideas about how the interaction should work and later on to flesh out its backend before moving to a new milestone.

Languages and libraries

WebProtégé itself is implemented in Java and GWT (a Java web programming framework that compiles client side code to JavaScript), and so much of the plugin is written in the same language.

When I first joined the project, I was eager to avoid Java — being a Haskell programmer at heart I would prefer more concise programming coupled with a powerful type system — and jump to an alternative JVM language like Scala; however, the sticking point is that much of the plugin code is client-side, and efforts to build Scala GWT support appear to have stalled. Generally speaking, the compilation from Java to JavaScript ties our hands somewhat and can limit the Java libraries that we can adopt (all its dependencies must have GWT support). That said, at least the Guava libraries provide some useful additional (sort of an augmented standard library) features, as does Project Lombok (useful Java annotations for automated constructor, getter/setter generation, and a @NonNull runtime check). These make for a fairly verbose and not particularly safe better than nothing. Perhaps the most likely path forward in the future would be to see the WebProtégé project move to Java 8, and try to make do with the language improvements there.

Canvas management

The client side of WebProtégé and the plugin consist of JavaScript executed in the browser (producing a combination of HTML and SVG code along the way). The canvas end of this is cobbled together from a mixture of libraries and methods:

Widgets and handlers (GWT): widgets such as button and text boxes, as well as event handlers are provided by GWT (the code for these is written in Java and automatically compiled down)
Shapes and curves (Raphaël): the rounded rectangles used to represent classes are drawn with the Raphaël JavaScript library, largely through the intermediary of raphaelgwt (not to be confused with a similar, likely equivalent project, raphael4gwt)
Arrows and dragging (JsPlumb): shapes are connected with the help of the JsPlumb JavaScript toolkit, with a thin layer of helper code written in JavaScript and exposed to the Java client code via GWT’s JSNI mechanism.
Highlighting and ephemera (Java and CSS): some visual effects such as a blue boundary around the currently moused-over curve are baked into the plugin stylesheet

This unwieldy combination of technologies is held together with a somewhat displeasing amount of string and duct tape. It may be attractive to find a more full-featured framework (perhaps d3) that can cover more ground with a single uniform model.

As an illustration of how these libraries are stitched together, consider the case of user-editable labels on arrows. We make use of JsPlumb’s ability to place custom “overlays” in the label position consisting of arbitrary DOM objects. We would typically express these as JavaScript functions that create and return the objects in question (a textbox); however, we also require the plugin (the Java code) to be able to keep track of what the user typed into label. It took some trial and error, but in the end, I ended up generating a textbox widget and DOM identifier in GWT and passing the identifier on to JsPlumb via an anonymous function that always executes a JQuery selector for the textbox id.

Likewise, the Raphaël library provides us with the ability to draw and perhaps animate shapes, but does not appear to provide a way to connect these shapes, or react to their being moved around (connections to shapes should move with the shapes). That functionality can be addressed with JsPlumb; however, it can only manipulate HTML DOM objects (notably div elements) and not the individual SVG objects that Raphaël uses. The end result is that we have a canvas composed of many JsPlumb-managed div elements, inside of which is a miniature SVG document containing a single curve.

Integration

At the time of this writing, WebProtégé does not offer a mechanism for developing third party plugins (unlike its desktop cousin), but we believe that this will come in due course.

In the meantime, I have tried to keep the plugin as self-contained as possible. The code lives in a completely separate package from WebProtégé proper, as do its JavaScript, CSS modules, and GWT module files. As it stands, the places where I have modified the WebProtégé code itself include:

server side: registering actions implemented by the plugin with WebProtégé’s dispatch mechanism (will be obviated by future WebProtégé annotations)
client side: registering the concept diagram portlet with the UI factory (likewise to be obviated)
UI configuration: adding a user-defined tab to the UI configuration XML file (WebProtégé UI configuration mechanism to be rethought in the future)
build system: referencing the plugin GWT module file from the WebProtégé one (potentially difficult to avoid)
third party libraries: extending the WebProtégé Maven repository with dependencies used by the plugin (potentially hard to avoid; the alternative would be changing the build file)

Future work

In its current state, the plugin should be treated as more of a proof of concept than a useful ontology building tool. It shows us what ontology editing might look like using a formal visual language such as concept diagrams and hopefully motivates future work into building something closer to a production quality editor. Getting from prototype to usable tool will require a progress in different areas:

filling out partial functionality
improvements to the user interface
critical features
backend improvements

Bugs and partially implemented features

Some of the partially working components in the plugin should be possible to flesh out with “just” a matter of work:

Saving and loading only supports curves and not properties
There is no way to interact with properties aside from changing their labels
Anonymous classes cannot be snapped to

The full list of bugs and half-features is fairly extensive and can be found on the issue tracker in my GitHub branch.

User interface shortcomings

Canvas manipulation is fairly awkward as currently implemented. From a development perspective, it would be a good idea to try and address the awkwardness before internalising it and forgetting to be bothered by it. For one thing, the plugin is quite sensitive to mouse positioning, and mousing out when in the middle of editing a label can be somewhat frustrating.

Likewise, resizing is implemented in a fairly non-standard way (largely for ease of implementation): the user (1) hovers over the class (2) presses the resize button to initiate resize mode (3) drags the mouse to resize (4) releases the mouse button. Aside from being non-standard for users accustomed to more common idioms (a mode-switch which happens immediately as the user mouses over a corner), the resizing action is also extremely sensitive to the placement of the mouse. It is to stop the resizing action accidentally by drifting the mouse over a different object in the canvas that happens to be in the vicinity.

A more serious issue is that the prototype does not account for object selection in the case of overlapping (any behaviour here is an accident of the DOM as much as anything else). This issue must absolutely be addressed if the plugin is to get any use in the real world. In low-level terms, the solution may have to involve promoting smaller objects along the Z-index so they are more accessible than the larger objects behind them. Overall I believe that my fuzziness around the mouse focus model in the DOM (or as exposed by GWT?). For example, I discovered only by error that the mouse only ever hovers over one object at a time, the topmost object in case of stacking. This can make it difficult to implement behaviours that trigger whenever the mouse passes over an object, regardless of other objects that may lie in front of it. It would be interesting to see if there are any libraries out there that somehow capture these mouse focus and Z-index issues and expose them to the programmer in a somewhat higher level and/or more intuitive way. But library are no library, a more solid grasp of the underlying UI model would be a great help.

Aside from the various UI bugs and flakiness, one thing which became quite clear with the availability of the subsumption pattern is that having to move shapes around individually can be slow and frustrating (especially given with the Z-axis issues we have), and also a source of potential error, if for example we were to move a subsuming shape and forget to take its embedded subsumed shape along for the ride. A key feature here may be some sort of grouping mechanism, or perhaps a perhaps simple lasso-based selection idiom.

Diagram interpretation

Although we chose a patterns-based interface with the expressed purpose of constraining the interface (so that diagrams are more likely to be an accurate reflection of what the user intended), it became readily apparent there will be likely no getting around the need to develop what is effectively a free-form editor. The problem is that we already support some free-form editing operations that would be hard to do without — the ability to delete, resize, reposition, and relabel curves — and with each of these operations it can be quite easy to change the meaning of a diagram. Moving shapes around can cause them to intersect. Resizing a shape can cause it to include other shapes (so in fact it’s perfectly possible to express subsumption without the subsumption pattern). Right now, the plugin makes no attempt at interpreting or noticing these changes, which makes it extremely easy to cause the diagram to mean something very different and to have it go out of synch with the ontology.

One way of addressing the issue may be to work how to constrain the user interface even further so that the only things the user can express that are easy for the editor to notice and treat as an explicit intentional event. For instance, we may get a great deal of distance by simply refusing to let any resize/move operation cross a curve boundary (be it entering or leaving). But it seems like this sort of approach would mean us getting caught up in a perpetual game of whac-a-mole, anticipating the sort of semantic-changing operations a user could perform.

I believe we need a more general approach that guarantees that any possible diagram a user can draw will be (A) a well-formed concept diagram and (B) have a interpretation in the ontology. Roughly speaking, I believe we need a way for the plugin editor to read the canvas and convert what it sees to an abstract concept diagram along with the necessary translations needed to tie it in with the ontology backend. Furthermore, we will likely need a way to do this dynamically so that we are not just reading a canvas snapshot and dumping the results whole, but interpreting individual user actions and the corresponding changes to the abstract concept diagram and ontology.

Automated layout

That it was possible to extend WebProtégé to support something like this plugin is a testament to its modular design. Components interact with each other using a basic blackboard architecture, each component listening for and publishing events as needed. So having the plugin publish changes to the ontology, and having these changes be reflected in other standard WebProtégé components (the class tree viewer, the OWL entity description entity editor) is just a matter of events. By rights, we should also be to go full circle. At the moment, most changes to the ontology that come from other components are completely ignored by the diagram editor.

For the plugin to truly participate as a WebProtégé citizen, it must keep itself updated in light of any changes to the ontology that come its way. Among other things, this would entail some sort of automatic layout mechanism. In fact, the VMG already have done considerable work on the problem of automated layout (including an implementation such as iCircles), so this may just be a matter of integration. That said, one potentially important twist is that that we would presumably want diagrams to be as stable as possible, preserving as much of its original layout as new facts on the ground allow. Also because the diagram will change (however minimally) as a result of the automated layout, it may also be helpful to include some sort of animation for clarity, gradually shifting the parts of the diagram to accomodate new assertions.

No more duplicate curves

Finally, it’s worth taking a closer look at the decision to copy curves that participate in a subsumption event, and more broadly, to allow multiple curves for the same class. This is extremely unfortunate from a usability standpoint, because as shown in VMG experiments, the duplication severely impairs people’s ability to reason about diagrams. That said, in the context of building an interactive editor, it’s not clear what a better solution would be. I am confident there is one out there but it will require a lot more sophistication on the part of the plugin:

Moving one of the curves instead of copying it is a somewhat delicate affair because the relative location of a curve with respect to other curves (in a sort of topological sense) is semantically significant. We would need to account for other curves in the logical neighbourhood and recursively factor in any impact those curves have on those around. That said, we could also think about this problem in terms of automated layout. Perhaps in the long term, the right way to think about this problem is as one of automated diagram layout which we will already need anyway in the long term. In this case, using the subsumption pattern would simply corresponding to asserting a subsumption axiom and laying it out accordingly.
We would change the subsumption pattern to trigger on the first endpoint and draw an anonymous curve either inside or outside of the first snapped (depending on which endpoint was selected first). We could then force the user to enter a name for the newly drawn curve (not one that already exists), or monitor the diagram for when the user does insert a label (at which point we would update the ontology backend). This could be worth exploring but it comes at the potential cost of introducing yet another idiom to the user interface (that said, it’s too early to speculate on whether this is a bad thing or not)

Our current approach of simply copying the curves should be considered a stopgap measure, and a very unsatisfactory one at that. To go beyond that, we would likely need either automatic layout or diagram interpretation as prerequisites, neither of which should really be considered a dealbreaker because they are important to have in the general case anyway.

Abstract representation for diagrams

In developing the plugin, I had adopted a working style which deliberately emphasised visible results over correctness or maintainability. But this will not be sustainable if we want to move beyond the proof-of-concept stage. One particular sore point is that we have tried to get away without much of an abstract representation of the concept diagram editor. What little that passes for a model is a set of curves with an optional label/IRI along with some size/coordinate information, and in addition to this, an IRI to IRI parent relation mapping. Notice for example that although we have a notion of properties via the All Values From plugin, we don’t actually track them in any way.

It is going to be important to flesh out a much more robust and complete diagram model. In fact, some of the core functionalities we also need (automated layout, diagram interpretation) also rest on our having modeled concept diagrams correctly. A natural place to start looking would be theoretical work by the VMG in formalising concept diagrams. The formal model may serve as a strong starting point, but we may also need to extend it or a build a two-layer model that also reflects key physical properties, in other words, the sizes and positions of curves.

Better technologies

It is a design goal for the plugin to work with Protégé, or more to the point a popular and robust ontology editor. Our choice of WebProtégé in particular is motivated by our belief that this is where the future of the Protégé project lies. In a sense this somewhat ties us to the Java and GWT world, with even Scala (at the time of this writing) not being a helpful alternative. It seems that we are bound to Java to the extent that we wish to write client side code which translates to JavaScript, and to communicate with the WebProtégé server.

There are some avenues to explore still. Much of the user interface code relies on some fairly advanced JavaScript libraries which sometimes are made even more useful by GWT bindings Which expose the functionality with a friendlier Java API. It would be helpful to keep apprised of what’s new in the JavaScript user interface and graphics world. I’m attracted to the prospect of using D3 for example as an alternative to Raphaël and JsPlumb, mostly in the interest of reducing the number of distinct technologies out there, and also in the hope that introducing an element of animation can make the plugin easier to understand.

Being the inveterate Haskeller, I am still holding out hope for us exploring some way to write chunks of the plugin in Haskell (or anything better that comes along). What is at stake here is not so much direct productivity gains, but a kind of intellectual leverage provided the combination of concision and an advanced type system. Considering how much unexplored territory there is in designing this sort of editor, any tool that helps us to think more clearly would be great boon to the work. But then again what about the JavaScript frontend and the Java backend? Hope is not lost. There is a very intriguing subset of Haskell called Fay which from the looks of it generates very clean JavaScript. We already are using a fair amount of non-GWT JavaScript in the library, so why not take this to the logical conclusion writing essentially the whole plugin as a giant ball of Haskell-generated JavaScript? Integrating this into the WebProtégé client can be handled with a small Java/GWT wrapper. As for the backend, maybe communication with it could be shifted over to some web standard which we could talk directly with in Haskell/Fay. Alternatively, we could simply put the calls in the Java/GWT wrapper layer and expose them via JSNI. Another possible benefit of this approach is that we can perhaps do it incrementally and tentatively, slowly increasing the amount of responsibility we delegate to the JavaScript side of the codebase until we gain confidence in what we are doing.

Conclusion

As can often the case in an academic project, work on this plugin has only opened up a much longer section on important future work. We still have a lot of progress to make, even to get to a minimum viable editor people can play with.

Meanwhile, I have greatly enjoyed and gotten much out of this opportunity to work with the VMG. In particular, it was quite helpful for me to face a sort of triple-challenge, working in a completely different subject matter (I come from a natural language processing background, but that said, think that automatic visualiation plus language generation must be a interesting combo), different kind of programming (web-based user-facing application), with a lot of using a lot of mainstream tooling (Java, a fancy IDE) I was unfamiliar with. Programming is partly about versatility, an ability to plunge into the many forms of the unknown. Having emerged unscathed from the process gives me more confidence in my relationship with this craft. So, thanks VMG’ers!

Resources

For more information on concept diagrams and how you might use them for your ontology engineering needs, visit the Ontology Engineering homepage run by the VMG. Code for the plugin lives in a GitHub fork of webprotege.

Example concept diagrams have been generated with the most excellent Haskell diagrams library. See the source code to see how much easier it can be draw diagrams with Haskell than by hand.

You can too. I’m a semi-academic freelance programmer. I like to make tools to help people do interesting research, and am currently available half-time.↩
The boxes in this pattern represent separate contexts in which to make assertions about objects. The boxes do so much isolate the objects from each other — they are in the one and same universe — but our claims about them. So in example above, encasing Class1 and Class2 in boxes simply means that we do not accidentally imply they are disjoint.↩
An individual which does not have any values for a property trivially satisfies the all values from restriction↩

Multiple adjunctions in GenI

2013-09-14T00:00:00Z

Multiple adjunctions in GenI

Posted on 14 September 2013

Tags: geni

I’m just back from a few weeks holiday visiting my parents place in Florida. Between playing with cute toddlers, hanging out with family and friends, I managed to do a bit of work on the surface realiser GenI.

An itch unscratched

The problem I wanted to address had been nagging me since my PhD, namely that multiple adjunctions could cause a combinatorial search explosion and crash the realiser. This is actually a solved problem both in the TAG parsing sense in the 1980s, and in the NLG sense (other people have solutions for similar issues with their non-TAG realisers).

If I understand correctly, the solution at heart is to use some sort of packing strategy, which in providing a compact representation of similar paths through the search space, has the effect of steering you away from unproductively similar routes. You just cover the key points and can unfold any variations on those key points at your leisure. Algorithms like CKY/Earley do this by making it so that for the purposes of parsing, different intermediary paths to the same indices are treated as the same (if you’re not familiar with any of this, just think dynamic programming). Unfortunately, by the time I’d wrapped my head around the existing solutions, refactored GenI to allow different algorithms from its original (would have been a total rewrite), and made a first bad implementation of Earley; I’d already moved on. Thesis defended, postdoc started, GenI to the back burner. Itch frustatingly unscratched.

And still… not… properly… scratched. This thing likely isn’t going away until I manage to sit down properly and bang out a version of GenI with packing built in. On the other hand, what I did manage to do over the holiday was to provide hopefully a bit of low-cost practical relief. We have three basic issues for the surface realiser to contend with: lexical explodiness (having to choose from multiple lexical items to realise part of the input), and permutative explodiness (multiple orderings of these items), and combinational explodiness (multiple subsets of these items).

Pieces of a solution

The two-penny optimisations I managed to implement assume a slight relaxation in requirements, that you don’t want all the surface realisation results GenI could offer, but just some result. We provide no relief from combinatorial explosion but merely reorder the surface realiser’s exploration of the search space. If you use the --maxresults flag to cap the number of results returned by the realiser, these optimisations should hopefully lead you to those first N results sooner that it would otherwise¹; but if not, it will still go kaboom.

Guided realisation

Suppose you have some way to group chart items into a “path”. Each path represents a set of items which “fit together” to form a potential solution. All solutions must exist in some path or another (and a path have contain more than one solution), although some paths may be duds (which is also OK). If you are familiar with the polarity filtering optimisation in GenI, a path corresponds to a path through the polarity automaton.

Guided realisation is basically the idea of using paths to steer the realiser towards producing a solution as early as possible, by making it focus on one path at a time. If good paths are relatively few and we don’t have a way to sort them, we could in the worst case laboriously step through a bunch of bad paths until we focus on a good one (although note that paths can share items, so work isn’t lost just delayed). If paths tend to contain solutions, then guided realisation should help things go a bit faster by focusing our attention in a more depth-first manner.

Implementation-wise, guided realisation is easy and non-invasive. Our chart generator already has a notion of an agenda, so all we have to do is to impose an order of the paths (NB: we need to be able to enumerate the paths for this, so they must be known beforehand), and explore the agenda in that path-order. (New chart items that are produced and added on to the agenda fit into this scheme just fine because they belong to whatever paths their parents did).

Safe pruning

Guided realisation can be used to save time by guiding us to a solution, paths permitting. We can also use the same information to save a little bit of space. Chart items that correspond only to already-explored paths don’t contribute anything towards our question for a(nother) solution, so we can just delete them. Doing so will at least cap any growth in memory that comes chart items unique to a path. Note that if there is a lot of overlap among paths, this will have little effect as the shared items would not be deleted. I call this safe pruning because we are only deleting items that we already know have no more contribution to make whatsoever.

Nothing earth shattering here, but hopefully something to keep a little bit of a lid on the memory growth.

Novelty sorting

Guided realisation (plus safe pruning) can be helpful if your adjunction pain is partly lexical. If on the other hand the pain comes a permutation/combination issue (such as intersective modifiers), they are of no use. I have an examples in the GenI distribution of a test case that blows up with just multiple instances of the auxiliary tree (covering different parts of the semantics) given free reign in the order of adjunction. What kills us here is a proliferation both of valid solutions (based on different permutations of the items) and of intermediary chart items along the way.

(An interesting variation on this, one which also introduces lots of spurious explodiness — a proliferation of dud chart items — is when you have multiple adjunctions… to multiple sites, ouch!)

What kills us here seems to be the possibility of exploring basically the same thing over and over again in different orders, generating lots of chart items and blowing up our memory usage along the way. So why not steer the realiser along just one of these permutations at a time? This is what novelty sorting tries to accomplish. First, we sort the agenda according to the completeness of its semantics. The more literals you cover, the more we want to look at you. If there is a tie, we also sort by the number of times we’ve seen an item bearing that semantics. We want the biggest, newest semantics first, hopefully items which represent a path away from boring permutations of the same thing.

There may be cases where this sort of thing might go horribly wrong. Maybe it’s possible to have lots of really promising long partial solutions that end up being dead ends, and by steering things towards more complete solutions first we may just have hillclimbed ourway to nowhere. But I’m hoping in practice that the length + novelty sorting is the sort of thing that will break you out of a permutational bind.

Hope that helps!

That’s all I have perhaps until my next holiday. Maybe next time, I’ll actually get to write that packed algorithm I’ve alrways been hoping for. I’m thinking maybe we can define the equivalence classes according to the list of remaining adjunction capable nodes (the idea being to pack different paths to get to that same list…) or maybe just rolling up my sleeves and pushing some Earley dots around. But I hope these lightweight additions to GenI will make life a bit nicer for users in the interim.

On the other hand it could perhaps cause it to explore dead ends sooner in a pathological case :-(↩

UIMA annotation model

2013-08-12T00:00:00Z

UIMA annotation model

Posted on 12 August 2013

Tags: stac, corpus, uima

About

As part of my work on the educe library, I’m doing a small survey of annotation models used in a handful of text processing and annotation frameworks. So far, I’ve covered Glozz, the annotation tool used in our project; and GATE, a pluggable text processing framework.

The next model of interest is the one used by UIMA, the Unstructured Information Management Architecture. I tend to think of UIMA as being a generalisation on GATE. It was initially developed by IBM in 2001 to improve the interoperability of text (or other modal) analysis components scattered throughout IBM at the time, and apparently to good effect, if you recall IBM Watson defeating a couple of human stars at the gameshow Jeopardy! (2011). Somewhere in between UIMA was added as an Apache project and has become an OASIS standard to boot.

Type systems

My first impression of UIMA is a sense of extreme generality: you can have annotations on any content using any mode communication, eg. text, audio, video. Behind this generality, is the fact that the UIMA architecture is design around object oriented type hierarchies: for whatever annotations you make, you define your own classes and subclasses, but they must conform to some minimal specification.

Confusingly enough, the UIMA documentation uses the word “type system” to refer to a specific set of classes and their relationships, as to the usual programmer understanding of it as being the overarching typing rules and features (eg. if your type system has subtyping, generics, etc…). In the UIMA world, the idea is that each different kinds of documents have different kinds of annotations, and thus different “type systems” (set of classes), and if you want to share annotations, you provide an encoding of your type systems to go with it. So how does UIMA model such type systems then? As with many other OO languages, it makes a distinction between entities on the type level (classes and features), and those on the data level (objects and slots):

classes: single inheritence, have multiple features
features: have a type (a class), and cardinality (lower and upper bound); features come in two varieties: attributes, whose type must be one of the primitives classes; and references, whose type is a class
objects: instances of classes, has a slot corresponding to each feature defined by its class
slots: a pair of a feature and some values (a list?). The number of values and nature of its contents are determined by the feature’s cardinality and type. If the feature is an attribute, the values must be primitives; likewise if it’s a reference the value must be objects

Note that this model can be captured in the Eclipse project’s ECore modeling language ¹, which also provides a file format to express UIMA “type systems”.

Primitives (base type system)

UIMA in its great generality does not mandate the use of any type system in particular; however it does provide a base type system on which other type systems can be derived and which is assumed to be common across all UIMA-compliant analytics, applications, and frameworks (sounds effectively mandatory to me!). The base type system includes some primitives (strings; bools; bytes; 16, 32, 64 bit ints; 32, 64 bit floats), plus the EObject type subclassed by everything including primitives.

Annotations (base type system)

Aside from the primitives, the base type system has the notion of an annotation. An annotation in the UIMA world is

an object (ie. arbitrary)
with regional references (eg. offsets)
on an subject of analysis, “sofa” for short (eg. some text)

In other words, we have the usual notion of standoff annotation on a document using offsets, but generalised so that instead of text we’re working on “sofas”, and instead of offsets, we’re working on “regional references”.

In the UIMA base type system, these annotations are represented with the Annotation and SofaReference classes. Basically each annotation has a pointer to the content it’s annotating. The SofaReference serves as that pointer and can be shared by annotations, which keeps things nice and factored out. Of course, merely pointing is far too sparse a notion of annotation to be useful in practice, so you would want to subclass it and supply the actual annotation data as well as the offsets.

Putting it together: CAS

So far we’ve discussed the abstract building blocks for UIMA-style annotations, basically a small language of types (classes/features), a handful of primitive types, and an annotation/pointer generalisation of standoff annotation. These blocks can be assembled together to form a a CAS (Common Analysis Structure), a sort of self-contained bundle consisting of

a document (called artifact in Uimaese, since it handles any kind of unstructured data, eg. video), perhaps represented as a URI?
annotations and metadata (UIMA considers annotations to be part of the document metadata)
the type system used by the annotations, perhaps represented as a link

There are a couple of details worth noting. First, keeping in mind the extreme generality of UIMA, these artifacts, may not necessarily be single documents, but could also potentially be multiple objects (of different modalities even), for example perhaps an audio file and the text transcription it is paired with. At the end of the day, all we have is as object representing some content, and some annotations that refer to pieces of that object. Second, as sort of general design pattern, UIMA suggests that annotations be grouped together into a set of views on their artifacts, which provide a particular interpretation or perspective on that artifact (for example, secret vs open annotations on a document, annotations on different language translations of it, or even just a subset of the annotations like “all the pronouns”). The notion of a view seems like a relatively minor point, but it’s useful to know in general that the pattern is there.

Overall, CASes are basically what UIMA components produce and consume; they have a standard XML representation based on the XMI specification. Perhaps given its generality and its reliance on pre-existing standards before it, it would be good to consider the UIMA format for storing CASes as an alternative to inventing a new annotation format? There would be a bit of overhead to pay, particularly finding or defining a type system to capture the annotations being used, but it sounds like this is work that would have to be done anyway.

Questions

I’m not entirely sure what lessons I can draw from UIMA for how to understanding how to do and represent annotations. I suppose the idea that offsets can be generalised into some unspecified notion of Reference is useful, and that being explicit about types makes sense.

Otherwise, I’m not entirely clear about what UIMA has to say (if anything) on issues related to how annotations are exploited/combined, and how annotations from different sources interoperate. The explicit type system has something to do with it. Presumably, the ability to publish your types and the desire for interoperability puts a little bit of pressure on folks doing similar kinds of annotation to converge on the same type systems.

I’d also be curious if UIMA has anything to say in particular about complex annotations, for example annotations on annotations (as with relations and schemas in Glozz), or hierarchical structures like parse trees. Searching around a bit, I found this paper on an fairly extensive looking UIMA type system for clinical NLP applications (note also its implementation in the cTAKES system), which seems to cover syntactic structure and provides a notion of relations. Here, annotation objects just straightforwardly point to other annotation objects in their slots. This seems very useful to build off and is the sort of thing that the UIMA model would allow for, but I do wonder how it fits into for example the idea that annotations should have a subject of analysis, and a regional reference of some sort. Does the same apply to these annotations in a sensible way? Is the sofa of a parse tree constituent some piece of text, or some annotation, or something else?

Summary

What makes the UIMA annotation model distinct as far as I can tell is (1) an effort to be as general as possible, through (2) explicity notion of an extensible type hierarchy for annotations which (3) is encoded/exchanged along with the annotations themselves. In other words, UIMA annotations come with blueprints.

Comparing UIMA with other annotation models, I find that the four pivot points I’ve been using so far are not as helpful given UIMA’s agnosticism about everything.

substrate : unspecified (can be anything)
typology : annotations are arbitrary objects with slots (features), constrained by a type system local to its CAS
spans : unspecified
features : typed (either a primitive or some reference type), can can be associated with multiple values

The layering seems to be something along the lines of

CAS : document(s) (artifact), annotations, metadata
annotation: some features, reference to content
features : rich attribute-value pairs

Also, the documentation mentions a couple of case studies demonstrating the generality of UIMA. The UIMA and GATE have produced an interoperability layer for example, making it possible to transparently recycle a UIMA analytic as a GATE processing resource and vice-versa. The UIMA folks also did a similar experiment importing OpenNLP components, although with a bit less generality (each component had to have its wrapper defined separately).

Acknowledgements

Most of my understanding of UIMA comes from a combination of the Ferrucci et al.’s 2006 Towards an Interoperability Standard… IBM Research Report, and a higher-level overview from Ide and Suderman’s 2009 Briding the Gaps….

I believe ECore is a superset of the UIMA object model in that the former supports multiple inheritence↩

Feeling BLEU

2013-07-08T00:00:00Z

Feeling BLEU

Posted on 8 July 2013

Tags: evaluation

If you’re working in machine translation or natural language generation, you are likely familiar with a couple of automated metrics for evaluating the quality of their output, notably the BiLingual Evaluation Understudy (BLEU), and its NIST variant. To use these metrics, it can be helpful to learn what’s behind them, both the theory and maths behind them, and also the nitty gritty implementation details around their use.

The former discussed in the papers, and to some extent their Wikipedia articles. Details on the latter, however, seem to be somewhat lacking, so this post is written to fill the gap. We focus on one specific implementation ¹ of the script, version 13 of the NIST mt-eval script ². The information here is mostly gathered from reading the source code (plus a bit of Wikipedia and the papers for the theory), so there may be mistakes and misunderstandings on my part. This post may be less useful if you’re using a different implementation, although hopefully it at least provides a navigation guide to some of the concrete questions you find yourself asking down the line.

Rough idea

BLEU (n-gram precision with a twist)

BLEU is about n-gram precision at its very heart. We want to know the proportion of n-grams in the candidate text also appear in a reference text. There are a couple of twists to account for:

clipped precision - to deal overgeneration of common words, BLEU uses a “clipped” notion of precision that only accepts as many instances of a word as actually appear in some reference text. ³ If your text says “frog” 7 times, but that word only appears twice in the reference text, the clipped precision would be 2/7 instead of 7/7.
combined n-gram sizes - BLEU is about N-gram precision, but for what N? BLEU computes individual scores for a range of 1 to a maximum (by default 4), and computes the geometric mean of these via some logarithm magic (exp $ sum [ w * log s | s <- scores ])
brevity penalty - to game a precision metric, you might get away with just saying as little as possible so the little you say is right. For example, you might have a text that consists entirely of the word “the” and its precision would be 100%. Recall would be a way to capture these sorts of issues, but according to the paper is problematic when you have multiple reference texts (recall-everything-soup). Instead, BLEU uses a brevity penalty, which punishes you for having sentences that are shorter than the reference. The penalty allows for wiggle room, first if you have multiple reference texts by choosing the best matching text for each sentence; and second, by working over the corpus as a whole, rather than averaging over individual sentences. This way, your shorter sentences can effectively borrow a little length from their longer counterparts.

Basically: take n-gram precision, clip it, geometric-average it over different sizes of N-gram, and punish extreme brevity and you’ve got BLEU. The scores go from 0 to 1, as you might expect in a precision/recall type score.

NIST

The NIST metric is a variation on BLEU, which (according to the paper) which provides better stability and reliability on its older counterpart. The scores look rather different (no longer from 0 to 1), and they use 5-grams by default (instead of 4), but otherwise keep in the same spirit as BLEU. The main differences in NIST are

n-gram information value - the main addition in the NIST metric is to give more weight to more informative (read infrequently occuring) n-grams, which apparently helps the metric to resist gaming. For a given N-gram the informativeness is based on the proportion of occurences of the N-1 gram leading up to it and occurences of the N-gram itself (so to compute the informativeness of “my yellow pet duck”, we count the number times of “my yellow pet” occurs, divide by the number of times the full n-gram occurs, take the log)
changed brevity penalty - the penalty was tuned to “minimize the impact on the score of small variantios in the length of the translations”, basically keeping the spirit of protecting the metric against gaming, but with less wobble.

If I understand correctly, the core idea of using combined clipped n-gram precisions is the same, the brevity penalty is a bit different, and the change in scale comes from an additional number multiplied in that gives you better scores for more of your matching n-grams being the ones that are rarer wrt their n-1 gram predecessors.

Texts

Systems, documents, segments, n-grams

The script is used to evaluate the output generated by a set of systems. Systems can be people or software. Variants of some software (for example different parameterisations) can be treated as separate systems. When running these campaigns, I like to treat the reference texts as yet another system and feed it through the scoring pipeline.

System texts can be broken down into documents, which are further broken down into segments (eg. sentences), which in turn are broken down into n-grams.

Tokenisation and normalisation

Text is assumed to be in Unicode with the UTF-8 encoding.

Tokenisation seems to be done by splitting on whitespace. Prior to tokenisation the following pre-processing is done:

end-of-line hyphens are suppressed (line 2), eg in the word “representatives” below

We, therefore, the represen-
tatives of the United States
of America.
some notion of skipped tokens? (line 6 below), perhaps the idea is that you would would replace the parts of text you want to skip with
some SGML character entitities converted to their ASCII equivalents (lines 9-12)
text is converted to lowercase. Note however that the default tokenisation only lowercases ASCII characters (line 16)
ASCII punctuation characters except for ,-. (apostrophe, comma, hypen, and full stop) are tokenised. This is expressed as a series of ranges (line 17), which I’ve expanded by looking at the Wikipedia page on ASCII:
- {|}~ (curly braces, pipe)
- []^_` (square brackets, backslash, caret, underscore, backtick)
- ␠!“#$%& (space, bang, hash, dollar, percent, ampersand)
- ()*+ (brackets, star, plus)
- :;<=>?@ (colon, semicolon, angle brackets, equals, question mark, at)
- / (slash)
full stops and commas are tokenised unless they are surrounded by digits (lines 18-19), (NB: not 100% sure this is right)
dash is tokenised if it it’s preceded by a digit (42- becomes 42 and -) (line 21)
whitespace compresssion (no leading/trailing space, at most one space between tokens, lines 21 to 23)

There is also an “international” tokenisation option which is off by default. I haven’t looked into it much, but I can say it uses the perl lc function for lower-casing, whereas the standard just dumbly folds A-Z to a-z.

sub tokenization
{
	my ($norm_text) = @_;

# language-independent part:
	$norm_text =~ s///g; # strip "skipped" tags
	$norm_text =~ s/-\n//g; # strip end-of-line hyphenation and join lines
	$norm_text =~ s/\n/ /g; # join lines
	$norm_text =~ s/"/"/g;  # convert SGML tag for quote to "
	$norm_text =~ s/&/&/g;   # convert SGML tag for ampersand to &
	$norm_text =~ s/</</g;    # convert SGML tag for less-than to >
	$norm_text =~ s/>/>/g;    # convert SGML tag for greater-than to <

# language-dependent part (assuming Western languages):
	$norm_text = " $norm_text ";
	$norm_text =~ tr/[A-Z]/[a-z]/ unless $preserve_case;
	$norm_text =~ s/([\{-\~\[-\` -\&\(-\+\:-\@\/])/ $1 /g;   # tokenize punctuation
	$norm_text =~ s/([^0-9])([\.,])/$1 $2 /g; # tokenize period and comma unless preceded by a digit
	$norm_text =~ s/([\.,])([^0-9])/ $1 $2/g; # tokenize period and comma unless followed by a digit
	$norm_text =~ s/([0-9])(-)/$1 $2 /g; # tokenize dash when preceded by a digit
	$norm_text =~ s/\s+/ /g; # one space only between words
	$norm_text =~ s/^\s+//;  # no leading space
	$norm_text =~ s/\s+$//;  # no trailing space

	return $norm_text;
}

Scores

Blocks of text

The mt-eval script computes segment-level, document-level, and system-level scores (reporting system-level scores by default). If I understand correctly, the scores on larger units are not averages of their smaller counterparts, but aggregrates. The core idea here is to compute the counts that go behind the scores separately for each sentence, but then lump them all together for the whole corpus to compute the score. So if you have a set of fractions, rather than taking the average, you compute something more like this:

 n_1 + n_2 + .. n_x
 ------------------
 d_1 + d_2 + .. d_x

Digging a bit into the script implementation, the script runs segment by segment populating a map from N-gram sizes (ie. 1 to 4) to various counts (eg. number of matching n-grams, number of reference n-grams). The scoring algorithm take these maps as inputs, and the difference between the segment-level, document-level and system-level scores is that the counts on the system level is the sum of the counts on the document level, and the counts on document level scores is the sum of the counts on the segment level.

Missing segments

Missing segments will cause version 13 of the mt-eval script to crash (divide by zero). This is probably better than something silently happening behind the scenes that you’d have to dig through some documentation to find out about. There are a couple ways to deal with this, for example, counting it as a zero (which makes me nervous). Our approach has been to score quality separately from coverage and basically omit the missing segments from each side. Note that this means generating a seperate reference text for each system with potentially different segments missing.

Cumulative vs individual

The script distinguishes between a “cumulative” and an “individual” score. As far as I can tell, for an given N-gram length

the individual score is the score based on N-grams of the length alone,
whereas the cumulative score (the one which is reported) is the score which also includes factors in the smaller N-grams.

Whereas the individual 4-gram score would just be based on counting occurrences of 4 token sequences, the cumulative score would also include trigrams, bigrams, and unigrams (it would be the geometric mean of these scores)

As an aside, there are at least two stances you can adopt to the use of evaluation software. First that it is safer to use the NIST script as a sort of conservative default — this way you’re really using the same metric as everybody else — or second that we’re better off seeing different implementations of the same metric in the wild, which may help flush out unspecificed idiosyncracies, bugs in one version, and so forth. Despite finding compelling the argument that we should resist letting a particular implementation become the de-facto definition of a standard, I have chosen the “conservative” default of going with the NIST implementation.↩
Mt-eval does not seem to have a homepage of its own, but seems to be inherited by evaluation campaigns from one year to the next, for example OpenMT 2009.↩
If you have more than one reference text, we take the max count for each text. In other words, the clipping becomes a bit more forgiving, or a little less pronounced, although presumably not by a whole lot… This seems to apply to other aspects of BLEU as a rule of thumb; when there are multiple reference texts, choose the text which gives the best score for this particular count.↩

GATE annotation model

2013-06-12T00:00:00Z

GATE annotation model

Posted on 12 June 2013

Tags: stac, corpus, gate

About

To help build the educe package, I’m doing a small survey of annotation models used by tools I know about. I’ve done a model for Glozz so far as it’s what we’re using as annotation tool. But the work I’m doing involves trying to work with annotations from third party tools, for example, part of speech taggers and to do this work properly, I’d like to make sure I’m asking the right questions about integrating such heterogeneous annotations. Yes, standoff annotation may well be the answer, but what exactly do we have in mind?

If you’re working with natural language processing pipelines, you’re likely familiar with at least one of GATE or UIMA. Both frameworks provide a model, an API, and a fancy helper GUI that allow you to assemble pipelines of arbitrary NLP components. Generality is the name of the game, the idea in both frameworks being that you should be able to suck into them and any new tools that come along, and also to mix and match them at will. There also seems to be a fair amount of interoperability between the two frameworks, the two teams having cooperated to make it so that UIMA can run GATE components and vice-versa.

Since GATE appears to be the simpler (at least older) of the two frameworks, I’ll start by examining this one in a bit more detail. GATE was originally based on the TIPSTER architecture (Grisham 97), but its core model was slightly revised in version 2 (one span per annotation, annotations as graphs). It’s this later version of the model that I’ll describe below. Note that I’ll be using Haskell as a sketching tool as usual, but it should be possible to ignore it outright.

Features and feature maps

At the heart of GATE annotations is notion of a feature map. Features in GATE are attribute-value pairs, with strings for the attributes, and arbitrary Java objects for values. GATE provides a map/dictionary interface to feature maps, so we can assume keys to be unique (as would be standard).

-- any Java object
data Java = forall a . Java a

type FeatureMap = Map String Java

Feature maps appear in at least three places, as data associated with:

individual annotations
documents as a whole
corpus of documents

Annotation DAGs

Building up from the notion of a feature is that of an annotation. The GATE document describe the annotations on a document as forming a directed acyclic graph. The nodes of the graph would be offsets into the document (between characters), and the edges would be annotations of all sorts, each annotation being labelled by an identifier, a type (eg. token, sentence), and a feature map adding additional information.

data Node = Node
    { nodeId     :: Int
    , nodeOffset :: Int -- between characters
    }

data Annotation = Annotation
    { annoId       :: Int
    , annoType     :: String -- eg. 'token', 'sentence'
    , annoStart    :: Node
    , annoEnd      :: Node
    , annoFeatures :: FeatureMap
    }

newtype AnnoSet = AnnoSet (Set Annotation)

For illustration, the graph below shows what the annotation DAG for the sentence “Cyndi savored the soup.” might look like. Note that the text of the sentence is not properly speaking part of the graph (standoff annotation), but are presented on the diagram, but are provided as a visual hint.

Documents and corpora

Now that we have a representation of features and annotations, the rest of the GATE model is a matter of packaging: a document associate text with annotations and metadata (eg. author), and a corpus is a set of documents plus additional metadata (eg. institution that harvested the corpus).

Looking a bit closer into the documents, it’s worth noting how annotations are not necessarily all lumped together but grouped into multiple annotation sets. I’m not sure what the implications of these is from a modelling standpoint, but it would make sense if we want to explicitly model the fact that annotations come from different sources. For example, you might associate each annotation set with metadata like the tool or annotator that produced it, or its annotation schema.

data Document = Document
    { docContent     :: Text
    , docFeatures    :: FeatureMap
    , docAnnotations :: NonEmptySet AnnoSet
    }

data Corpus = Corpus
    { corpusDocs     :: NonEmptySet Document
    , corpusFeatures :: FeatureMap
    }

-- we can imagine a variant of the Set data structures
-- that enforces there being at least one member
type NonEmptySet = Set

Summary

Taking a very broad view of the GATE model as I understand it, we can say that it makes the following implicit choices:

substrate : text/characters
typology : annotations on text (like Glozz units)
spans : contiguous, one per annotation
features : attribute-value dictionary; rich values

Drilling down a little bit, we can understand the model as having four layers:

corpus: set of documents, features
document: text, (set of) set of annotations, features
annotation: type, span, features
features: rich attribute-value pairs

Notes and comments

It’s worth noting the use of arbitrary value types on features (not necessarily atomic). I’d be curious to see what sorts of values GATE features tend to take in practice. So far, I’ve seen strings and lists of strings, but I wonder if it ever gets more sophisticated than that. Do you have trees? Maybe some sort of recursive feature structure scheme?

Also, there is as far as I can tell no explicit notion of annotations on other annotations; however, perhaps the possibility of encoding one in the rich feature structures? The GATE docs provide the example of using a constituents feature in which the annotation that corresponds to some parent node in a tree could point at the annotations for its daughters. Maybe using pointers in feature maps is good enough for our needs? (Also the notion of a text span may not be so useful on such annotations, but could maybe be worked around with a nonsense span, or by computing one from the leaves)

Finally the GATE docs seem to say that their model is “largely compatible” (what does the largely mean? where is it not compatible) with the Bird and Liberman 1999 annograph stuff, which I hope to look into somewhere along the line.

Humility without bullshit

2013-05-16T00:00:00Z

Humility without bullshit

Posted on 16 May 2013

Tags: self, zen

WARNING This blog post contains American levels of earnestness. Consult your local for advice on suppressing the gag reflex

I think one thing I like to aspire to is a sort of Humility without the Bullshit. It’s hard, something you can catch yourself failing at over and over again, and failing into two different ways.

The first form of failure is the bullshit part. What I would consider to be True Humility would have no traces of “humility theatre”, the professions, gestures, and outward behaviours that serve only to say “Hey look at me, I am Humble™”. And that you have to be on the guard for. Shit, did I really mean that? For example, when I weigh in on the Zen Reddit with my “just a dude that goes to dojo” disclaimer, is that genuinely meant to avoid accidentally conveying authority, or was I merely trying to cover my ass with a little Humility Theatre?

The second form is I think more subtle, which is that what I would consider to be True Humility is something that manifests NOT in what you say, but in the every way that you go about things. Your every action, your every decision, your every assessment of every situation. I imagine this True Humility to be a sort of Pervasive Ever Vigilant Sense of Not Knowing, and for it to be real, it has to be unstated. No theatre, just choices. Say you are programmer: what kinds of technical decisions do you make? How do you go about approaching a new skill or unknown language? How do you communicate your ideas to fellow programmers, or to your non-programmer colleagues? How sharp do you keep your awareness of the kinds of expertise they may have that you do not? How do you receive the criticism that is offered to you? True Humility is implicit in your every action.

And so you fail. You fall down over and over again. But hopefully with the aspiration in mind, you do better the next time. And when is this next time? Now. It’s a useful phrasing which I think comes out of contact with Zen communities: aspire to X. We do things without a discrete goal/endpoint in mind of some imagined better version of ourselves in some imagined future; yet we nevertheless aspire to X. In the Zen context, we might say that we aspire to be aware… which to me means that we strive to be awake and aware now, not as part of some hypothetical future enlightened self but as the person who is sitting on that cushion now, who is walking down the street now, who is designing a API for discourse structure processing now. I think the same sort of language of aspiration applies to other contexts. We aspire to be truly humble, we aspire to see things from others’ perspectives, we aspire to take better care of each other… Now.

It’s something that irritates me about Be Here Now language you might find in people conveying a sort of New Agey blend of generic vaguely Eastern spirituality. To my grumpy atheistic ears, this sort of language conveys a sort of vacant bliss… lalala not thinking about filing my taxes… lalala smell the flowers whoopsie daisy how carefree I am. Yeah shut up. Forgive my lack of humility here. As you can see, I know very little about Generic Vaguely Eastern New Age Spirituality (and yet! I permit myself to comment!). If my portrayal is right, fans of the Spiritual Curmudgeon could do well to take back the language of Here and Now. Here and Now isn’t some kind of spiritual high for you to enjoy, but also your place of responsibility. Here and Now, brush your teeth, make your bed, do what is needed. And Here and Now, approach the world with a little more humility.

logbook: Glozzy graphs

2013-04-05T00:00:00Z

logbook: Glozzy graphs

Posted on 5 April 2013

Tags: stac, glozz, logbook, python, educe

Worked in a few different areas this week:

polished off the job of outputting Glozz XML
intake pipeline
studied a couple of annotation models
started playing around with build a graph representation of annotations (scroll down for a picture)

Glozz XML

It turns out the trouble I was having (making Glozz crash) was basically due to the fact that I wasn’t associating my annotations with any metadata (author, creation date, etc).

Working on this problem has also given me a chance to think a little bit about how to deal with the problem of generating identifiers for automatically derived annotions in Glozz, which in turn forced me to get to know the Glozz annotation model a bit better. The basic story is that timestamps are not great basis for creating identifiers, nor is a simple counter-based strategy not sufficient to guarantee uniqueness within the corpus (see glozz model notes for thoughts why). With these thoughts about global uniqueness in mind, I’ve decided on the following identifier scheme for our particular corpus:

tool-name
document
subdocument
a counter

The counter is just some integer based on the number of existing annotations in the document. I take the biggest existing number and bump up a couple powers of ten for readability, then start counting.

Also, it appears that Glozz relies on the creation date of annotations (annotations that have the same creation date as some existing annotation simply don’t show up in the GUI). Other tools in the pipeline use the convention of negative numbers as timestamps (where timestamping is not relevant I suppose). I just use the same counter that I use to generate dates.

Intake pipeline

At the moment, preparing incoming chat files for annotations requires an interleaved mixture of scripts and manual work. I decided it would be a good idea to set some duct tape to the task by stringing together as many of the automatic bits into a Bash scripts that we can run. This should allow for some extra repeatability and make it easier to fold it enhancements to our intake pipeline, for example if we find ways to automate some of the manual prep work.

This was a surprisingly useful thing for me to do, not so much for the pipeline in itself (although I think it’ll serve us) but for learning my way around our existing tools, around the Glozz model, and the annotation task in general.

Annotation models

So far have looked at Glozz and also at the annotation graph stuff in A Formal Framework for Linguistic Annotation by Bird and Liberman (2000). Hope to have my notes for the latter out by next week some time.

Graphs

One of the bigger pieces of work I’ll be doing for educe is to provide a navigable higher-level representation of the annotated corpus. What educe currently gives is as low-level as we can get, a document being some text and sets of annotations that sit on top of these. There are a couple of more useful structures we can derive from this. One of these would be a representation of the discourse graph (probably not the right terminology).

The good news for me is that Somebody Else has developed a nice-looking library called python-graph. It seems to do hypergraphs, which will be useful for us when dealing with complex discourse units. One handy utility any graph library should come with (and which this one does) is an option to output to graphviz.

Here’s a sample graph showing a small chunk of our annotated conversation (with dot file too):

(click to enlarge)

It’s worth noting that this is plain old non-hyper-graph. A better representation would be for the CDU to be treated as a subgraph rather than a node pointing to other discourse units. Also minor note, in the current abstract graph, I’ve got relations being represented as nodes rather than edges (ie. you you edges going into and out of the relation). This is just to deal with annotation files that have relations pointing to relations (something Glozz allows). It’s there for visualisation, but for a more abstract representation, we’re going to want something a bit nicer.

Next week will most likely involve me exploring the hypergraph portion of the python-graph API (via the concrete task of getting gv subgraphs) and some more study of annotation models.

Glozz annotation model

2013-04-03T00:00:00Z

Glozz annotation model

Posted on 3 April 2013

Tags: stac, glozz, corpus

About

To help build the educe package, I’m doing a small survey of the annotation models used by other tools. The hope is to find something I can just reuse, saying “here, we use the Foo model” (and format), thereby saving work and gaining potential interoperability with bits of the research world.

Barring that, I at least want to build some awareness what kinds of decisions I’ll be making (sometimes without knowing I’m making them). If I can make them on some principled grounds, it increases my chances of future-proofing the work, not painting myself in a corner with some fatally inflexible decision, or conversely creating a conceptual mess through insufficient constraints.

I’ll probably have to revise pieces of this survey as I go along, hopefully getting a clearer idea what sorts of questions I should be asking myself about model as I go along.

I have four things in mind, Glozz, GATE, UIMA and a paper by Bird and Liberman that NLTK mentions. Since we’re using Glozz as our annotation tool, I’ll start with that.

Structure

At the very bottom of the Glozz model, we have text (presumably a sequence of Unicode characters). On top of that, you have three kinds of annotation:

units: a contiguous span of text; overlapping, covering, and same-span units are OK
relations: a link between two annotations of any type (ie. unit, relations, schemas)
schemas: a link across an arbitrary set of annotations

It’s worth noting the flexibility you get by virtue of relations and schemas being able to point to annotations of any type, making it so that you can have eg. relations between schemas and units, relations and relations, etc. This sort of things sounds like it might be useful in the context of SDRT (Glozz was developed in an SDRT annotation context, if I understand correctly), where you have graphs formed of relations on utterances; but also relations on graphs themselves. By default, I’ll assume flexibility (in a model) is a good thing and that it will be up to applications to impose sensible constraints appropriate to their needs.

Payload

Of course, if you’re annotating text you don’t just want to have connected bits of highlighted text, but the ability to say things about those bits.

All Glozz annotations (unit, relations, schemas) are labelled with a set of attribute-value pairs. Features are not recursive (so we can think of the values as just strings, for example). Attributes can be constrained to a particular set of values if wanted (eg. colour ∈ { red, blue, green }), but can also be free form.

One thing not discussed if an attribute may be associated with more than one value (e.g. indicating disjunction, (‘color’,‘red’), (‘color’,‘green’). I’m guessing not since they’re called “features” (as in “feature structures”?) and also from some of the stuff we’re doing in practice.

Code sketch

This is a rough sketch in Haskell of what I think the Glozz annotation model looks like. I’m not that great at modelling, and hope that I’m not mixing up core issues with more implementation-level issues. For now I think it should work as a snapshot of my understanding:

-- | A document has some text and a mapping of identifiers
--   to annotations
data Document = Document Text (Map Guid Annotation)

-- | Annotations can be either units, relations, or schemas
--   All annotations are associated with some labels ('AnnoData')
data Annotation =
    Unit     TextSpan       AnnoData
  | Relation (Guid, Guid)   AnnoData
  | Schema   (Set Guid)     AnnoData

-- | Character offsets within a text
type TextSpan = (Int, Int)

-- | Every annotations has these in common
data AnnoData = Anno Guid FeatureSet Metadata

-- | The Glozz manual says that identifiers used are unique to
--   the world (more on this below)
type Guid = Text

-- | I'm assuming you can't have repeat attributes
type FeatureSet = Map Text Text

-- | Not really mentioned in the doc
newtype Metadata = Metadata FeatureSet

Identifiers

One detail that may be of interest is the notion of identifiers. By identifier, I think I mean a name, a tiny bit of data that allow us to distinguish bigger blobs of data from each other. In the Glozz model, all annotations are associated with an globally unique identifier. The manual provides Glozz’s approach to generating identifiers (which I trust are opaque, ie. not parsed). Glozz-generated IDs are the tuple of annotator id and Java timestamp (ms since 1970-01-01T00:00Z), eg. ymathet_1290167040405, with global uniqueness seemingly based on

centralised annotator names: the Glozz tool assumes a centralised procedure for handing out annotator names
millisecond resolution: human annotators don’t annotate faster than 1 ms at a time

This approach may be problematic from the standpoint of automatically derived annotations:

A central registry of recognised tool names would not be flexible enough in practice (would need to be amended every time you want to introduce a new tool, does not lend itself to small variants between say different parametrisations of a tool). Perhaps highly hierarchical naming conventions à la Java package would do the trick.
Automatically derived annotations can be made at sub-ms resolution.
It would be really nice if identifiers had a stability property, ie. that if you run the same tool on the same data, you should get the same exact results. Spurious trivial differences mean that you won’t be able to easily tell if two sets of data are the same, at least not with simple generic tools like diff.

Stepping back from this, I guess it’s important not to take the globally too literally, and that what you’re working with is an inherently restrained task-based scope. You need only be as global as the set of objects that will ever co-exist. So within an single corpus with multiple documents, yeah you want the annotations within these documents to be unique to the corpus; but you don’t care if they are globally unique so long as you never merge with other corpora. For automatic tools, the trick is to have a mechanism for generating ids that avoids timestamps (I use a counter), but which also provides some assurance of corpus-wide uniqueness (I use the document name).

Summary

Recapping the Glozz model as I currently understand it

substrate : text/characters
typology : anno = unit (text), relations (anno, anno), schema (anno*)
spans : contiguous, one per annotation
features : set of atomic attribute-value pairs; presumably unique attrs

logbook: spans and Glozz XML

2013-03-29T00:00:00Z

logbook: spans and Glozz XML

Posted on 29 March 2013

Tags: stac, glozz, logbook, python, educe

When I was working on my thesis, and to a lesser extent, when I was a postdoc, I had the habit of near-daily blogging as a way of thinking aloud. Now that I’m doing a bit more academic work, I thought I might see if I could build a similar habit albeit on a less frequent basis

Three bits of progress today:

NLTK `span_tokenize`

NLTK is a pretty handy thing.

I’m working on a toy discourse unit segmenter for our corpus, not so much because we’re interested in the segmentation per se (we are… but first things first). As an extremely crude first pass, I’m trying something fairly stupid, like “whatever NLTK considers to be a separate sentence, I will treat as a discourse unit”. Using the segmenter is easy enough; just call nltk.tokenize.sent_tokenize on a string and voilà, you have a list of sentences.

Great, you have a list of segments! Now try folding them back into your corpus…

Trouble arises because this segmentation output loses information, particularly the arbitrarily long bits of text (whitespace) between each sentence. We have the usual borader problem of wanting to integrate different layers of annotation in our text, be they human-authored or the output of third party tools like POS taggers. In short, standoff-annotation is our friend, but to be a good friend to standoff-annotation, we need to be able to grab text spans for all our annotations.

Luckily in the case of the NLTK segmenter, this is not the sort of information we have to reconstitute; it was there all along, but dropped in the easy bit of the API. I did a tiny bit of digging into the API and found that I could use the span_tokenize method. It’s a bit more fiddly; you have to set up the tokenizer yourself (gist):

import nltk.data
 
text = "Hello, I am a bit of corpus. Why don't you segment me?"
tokenizer       = nltk.data.load('tokenizers/punkt/english.pickle')
 
for start,end  in tokenizer.span_tokenize(text):
    print "%d\t%d\t%s" % (start, end, text[start:end])
 
 
# 0	28	Hello, I am a bit of corpus.
# 29	54	Why don't you segment me?

And with spans at our disposal, standoff annotation is just a little bit of interval arithmetic away.

Glozz

I managed to run the Glozz annotation platform on our corpus data, something anybody who’s been working on the project for a while should know to do. Now that I know my way around the corpus a bit and am not scared off by the demand for a login (just click “anonymous”, silly), I do too.

Mac users may be interested in my Glozz Mac helper script. Just a silly thing. It might be good to patch their code to allow for better native UI integration, something I’ll get around to basically never.

Anyway, this is useful because it lets me check my work for the main bit of progress

Glozz XML

I’ve extended my educe library to write Glozz XML files from its annotation tree data structures. The code isn’t very nice, but hopefully will improve as I grow more familiar with Python. I’ve verified that reading into educe data structures and writing back into XML is a round-trip. It’s diff-clean with some minor exceptions. It took extending the model, of course, to account for things like the metadata associated with each annotation.

Next week, I’m going to see if I can emit my NLTK-derived segments in Glozz XML. My first attempt crashed Glozz, so probably a good few things missing here. One of the problems I’m going to need to think about in particular is generating identifiers for things…

Glozz on MacOS X

2013-03-29T00:00:00Z

Glozz on MacOS X

Posted on 29 March 2013

Tags: stac, glozz, mac

Users of the Glozz annotation platform on MacOS X may prefer to launch it with this script, which tweaks the startup parameters to make Glozz behave a tiny bit more like a Mac app. In its current form, this means menu bar on top with “Glozz” rather than the package name as the application name. I’ve saved it as a gist on GitHub in case I make further modifications.

#!/bin/bash
pushd $(dirname $0) > /dev/null
SCRIPT_DIR=$PWD
popd > /dev/null

java\
    -Dapple.laf.useScreenMenuBar=true\
    -Dcom.apple.mrj.application.apple.menu.about.name=Glozz\
    -jar "$SCRIPT_DIR"/glozz-platform.jar