I’m just back from a few weeks holiday visiting my parents place in Florida. Between playing with cute toddlers, hanging out with family and friends, I managed to do a bit of work on the surface realiser GenI.
The problem I wanted to address had been nagging me since my PhD, namely that multiple adjunctions could cause a combinatorial search explosion and crash the realiser. This is actually a solved problem both in the TAG parsing sense in the 1980s, and in the NLG sense (other people have solutions for similar issues with their non-TAG realisers).
If I understand correctly, the solution at heart is to use some sort of packing strategy, which in providing a compact representation of similar paths through the search space, has the effect of steering you away from unproductively similar routes. You just cover the key points and can unfold any variations on those key points at your leisure. Algorithms like CKY/Earley do this by making it so that for the purposes of parsing, different intermediary paths to the same indices are treated as the same (if you’re not familiar with any of this, just think dynamic programming). Unfortunately, by the time I’d wrapped my head around the existing solutions, refactored GenI to allow different algorithms from its original (would have been a total rewrite), and made a first bad implementation of Earley; I’d already moved on. Thesis defended, postdoc started, GenI to the back burner. Itch frustatingly unscratched.
And still… not… properly… scratched. This thing likely isn’t going away until I manage to sit down properly and bang out a version of GenI with packing built in. On the other hand, what I did manage to do over the holiday was to provide hopefully a bit of low-cost practical relief. We have three basic issues for the surface realiser to contend with: lexical explodiness (having to choose from multiple lexical items to realise part of the input), and permutative explodiness (multiple orderings of these items), and combinational explodiness (multiple subsets of these items).
The two-penny optimisations I managed to implement assume a slight relaxation in requirements, that you don’t want all the surface realisation results GenI could offer, but just some result. We provide no relief from combinatorial explosion but merely reorder the surface realiser’s exploration of the search space. If you use the --maxresults
flag to cap the number of results returned by the realiser, these optimisations should hopefully lead you to those first N results sooner that it would otherwise1; but if not, it will still go kaboom.
Suppose you have some way to group chart items into a “path”. Each path represents a set of items which “fit together” to form a potential solution. All solutions must exist in some path or another (and a path have contain more than one solution), although some paths may be duds (which is also OK). If you are familiar with the polarity filtering optimisation in GenI, a path corresponds to a path through the polarity automaton.
Guided realisation is basically the idea of using paths to steer the realiser towards producing a solution as early as possible, by making it focus on one path at a time. If good paths are relatively few and we don’t have a way to sort them, we could in the worst case laboriously step through a bunch of bad paths until we focus on a good one (although note that paths can share items, so work isn’t lost just delayed). If paths tend to contain solutions, then guided realisation should help things go a bit faster by focusing our attention in a more depth-first manner.
Implementation-wise, guided realisation is easy and non-invasive. Our chart generator already has a notion of an agenda, so all we have to do is to impose an order of the paths (NB: we need to be able to enumerate the paths for this, so they must be known beforehand), and explore the agenda in that path-order. (New chart items that are produced and added on to the agenda fit into this scheme just fine because they belong to whatever paths their parents did).
Guided realisation can be used to save time by guiding us to a solution, paths permitting. We can also use the same information to save a little bit of space. Chart items that correspond only to already-explored paths don’t contribute anything towards our question for a(nother) solution, so we can just delete them. Doing so will at least cap any growth in memory that comes chart items unique to a path. Note that if there is a lot of overlap among paths, this will have little effect as the shared items would not be deleted. I call this safe pruning because we are only deleting items that we already know have no more contribution to make whatsoever.
Nothing earth shattering here, but hopefully something to keep a little bit of a lid on the memory growth.
Guided realisation (plus safe pruning) can be helpful if your adjunction pain is partly lexical. If on the other hand the pain comes a permutation/combination issue (such as intersective modifiers), they are of no use. I have an examples in the GenI distribution of a test case that blows up with just multiple instances of the auxiliary tree (covering different parts of the semantics) given free reign in the order of adjunction. What kills us here is a proliferation both of valid solutions (based on different permutations of the items) and of intermediary chart items along the way.
(An interesting variation on this, one which also introduces lots of spurious explodiness — a proliferation of dud chart items — is when you have multiple adjunctions… to multiple sites, ouch!)
What kills us here seems to be the possibility of exploring basically the same thing over and over again in different orders, generating lots of chart items and blowing up our memory usage along the way. So why not steer the realiser along just one of these permutations at a time? This is what novelty sorting tries to accomplish. First, we sort the agenda according to the completeness of its semantics. The more literals you cover, the more we want to look at you. If there is a tie, we also sort by the number of times we’ve seen an item bearing that semantics. We want the biggest, newest semantics first, hopefully items which represent a path away from boring permutations of the same thing.
There may be cases where this sort of thing might go horribly wrong. Maybe it’s possible to have lots of really promising long partial solutions that end up being dead ends, and by steering things towards more complete solutions first we may just have hillclimbed ourway to nowhere. But I’m hoping in practice that the length + novelty sorting is the sort of thing that will break you out of a permutational bind.
That’s all I have perhaps until my next holiday. Maybe next time, I’ll actually get to write that packed algorithm I’ve alrways been hoping for. I’m thinking maybe we can define the equivalence classes according to the list of remaining adjunction capable nodes (the idea being to pack different paths to get to that same list…) or maybe just rolling up my sleeves and pushing some Earley dots around. But I hope these lightweight additions to GenI will make life a bit nicer for users in the interim.
On the other hand it could perhaps cause it to explore dead ends sooner in a pathological case :-(↩