\documentclass[11pt,nofrench]{thloria}
%\documentclass{foils}

%\usepackage{koweydoc}
\usepackage{url}
\usepackage{graphicx}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{color}

\usepackage[latin1]{inputenc}
%\usepackage{koweyslide}
\newcommand{\jargon}{\textbf}
\newcommand{\tuple}[1]{\langle #1 \rangle}
\newcommand{\set}[1]{\{#1\}}
\newcommand{\semexpr}{\texttt}
\newcommand{\semexprset}[1]{\semexpr{\set{#1}}}
\newcommand{\natlang}{\textit}
\newcommand{\tautree}[1] {$\tau_{#1}$}
\newcommand{\koweytree}{\texttt}
\newcommand{\FIXME}[1]{\textbf{\textit{(FIXME:#1)}}}
\newcommand{\CallFunction}[2]{\textproc{#1}({#2})}
\newcommand{\PolAutomaton}{\tuple{States,InitialState,FinalStates,Transitions}}
\setcounter{secnumdepth}{3}

% we use a more standard notation for comments
\renewcommand\algorithmiccomment[1]{\hfill\(\#\) #1}%

\begin{document}
\ThesisTitle{Optimising a Surface Realiser}
\ThesisAuthor{Eric Kow}
\ThesisNancyI
\ThesisDate{22 juin 2004}

\def\blanc{\hspace*{1cm}}
\NewJuryCategory{Encadrants}{\it Encadrante:}
                            {\it Encadrants:}

\Encadrants  = {C. Gardent}
\Rapporteurs  = {N. Carbonell\\
                 D. Méry\\
                 D. Galmiche\\
                 J.Y. Marion\\
                 J. Souquières}
%
\MakeThesisTitlePage

\tableofcontents

\begin{ThesisAcknowledgments}
I have had an immense amount of help on this DEA.
\\

\noindent
For starters, my supervisor Claire Gardent has gone out of her way
for this little first step of a thesis.  Working with Claire has been a
fun and instructive experience, as she has this uncanny ability to
always ask the right questions.  This results in an endless supply of
good ideas that are invariably of the simple and practical variety.  
\\

\noindent
Carlos Areces has also contributed his fair share of elegant ideas.  I
would like to thank him as well for help with Haskell, the surface
realiser GenI, and a million essential concepts. Carlos speaks with such
pedagogical clarity that it almost seems like a bad habit; I wonder if
he would even be capable of explaining anything poorly if he wanted to.
\\

\noindent
Many many thanks to Patrick Blackburn and Hélène Manuelian and 
especially Claire and Carlos for reading the many awkward drafts of 
this thesis and red-penning its way into passability.  I hope to 
someday master the basic respect and consideration they have for
their reader.
\\

\noindent
The good strangers who frequent the channel \#haskell on
irc.freenode.net deserve at least a grateful nod for their kindness to
newbies.  
\\

\noindent
Finally, the following people have helped in a more lateral fashion:
Hélène with her fine Franglais-French translation of my DEA letter of
motivation; office mates Yannick and Joseph for puttting up with much
whining, and the occasional, inexpliquable ``moo!''.  Thank-you all for
your friendship.  And now I restrain myself from tearful speeches of
friends and family, and hope to do these people justice at a later time.
\end{ThesisAcknowledgments}

\mainmatter

% -------------------------------------------------------------------------
\chapter{Introduction}
% -------------------------------------------------------------------------

The surface realisation task consists in using a grammar to produce the
sentence(s) associated by this grammar with some input meaning
representation.  The meaning representation is written using some
logical language. For example, a surface realiser might receive an
\jargon{input semantics} of \semexprset{loves(l,john,mary)}, for which
it outputs the \jargon{surface realisation} \natlang{John loves Mary}.  

Surface realisation, also referred to as \jargon{tactical generation},
is similar to the better known task of natural language parsing.  When
doing generation, we construct syntactic trees which are associated with
the input semantics by the language's grammar, and trivially read the
tree leaves to reconstruct the output string. Polynomial-time algorithms
for parsing have long been developed, so it would be tempting to assert
that generation is parsing in reverse and simply reuse these algorithms.
But \cite{koller02gen} demonstrates that the Hamiltonian Path problem
can be reduced to a surface realisation task and so the problem is
NP-complete.  The intuitions for this can be found in \cite{kay96chart}.
Both parsing and generation have the potential to be computationally
expensive because natural language has a great deal of lexical
ambiguity; however, parsers have the luxury of string positions.  Using
string positions, one can pack an arbitrary number of alternative
structures into a small and fixed number of edges, so that parsing can
be done in cubic time no matter what the ambiguity. The
situation for surface realisers is more delicate because the input
semantics have no natural order, and so there are as many potential
edges as there are subsets of literals. This number is exponential with
respect to input length, and as a result, we are forced to deal with the
full effect of lexical ambiguity.

This thesis explores the problem of optimisation for a TAG-based surface
realiser.  We begin in this chapter with the essentials of surface
realisation and continue in section \ref{sec:geni} with a description of
the realiser developed within the INRIA ARC GenI by Carlos Areces and
used here as an implementation platform \cite{areces03geni}.  In chapter
\ref{chp:optimisations}, we propose a set of optimisations, implement
them into the GenI realiser, and discuss their effectiveness.  We
conclude in chapter \ref{chp:conclusion} with proposals for further
optimisation and pointers for further research.

\section{Background}
\label{sec:background}

Building a surface realiser requires three ingredients: a semantic
representation, a formal grammar for natural language syntax augmented
with a semantic dimension, and an algorithm for building syntactic
structures. The GenI generator uses respectively a flat semantics, the
Tree Adjoining Grammar (TAG) formalism, and a bottom-up/head-driven
algorithm.

\subsection{Flat semantics}
\label{sec:flat_semantics}

The semantics of a natural language expression can be approximated by a
logical form.  In this thesis, we write the semantics as a conjunction
of \jargon{literals}, where each literal is a predicate with some
arguments.  The arguments may not themselves be literals, so there is no
recursion, and we call this a \jargon{flat semantic representation}
\cite{copestake95mrs}. For example, to represent the meaning of the
sentence \natlang{John loves Mary intensely} we could write
\semexprset{loves(l, j, m), intense(l), name(j, john), name(m, mary)},
meaning that there is an act of loving \semexpr{l} between the
individuals named \natlang{John} and \semexpr{Mary} and that this act is
intense.  For a more detailed introduction to flat semantics, we refer
the reader to \cite{copestake95mrs}.

The various approaches to surface realisation work by successively
combining expressions which verbalise different parts of the input
semantics.  This aspect is particularly well served by a flat
representation.  In particular, note that in order to compare recursive
semantic representations, something like higher order unification is
required, whereas by using a flat semantics, this comparison can be
carried out using set intersection. Similarly, we can combine two
semantic expressions by taking their union.

\subsection{Tree-Adjoining Grammar}
\label{sec:tag}

Tree-Adjoining Grammar (TAG) is a formalism that generates the ``mildly
context-sensitive'' class of languages.  As defined in
\cite{joshi97tag}, a tree-adjoining grammar consists of a quintuple
$\tuple{\Sigma,NT,I,A,S}$ where 

\begin{enumerate}
\item $\Sigma$ is a finite set of terminal symbols.
\item $NT$ is a finite set of non-terminal symbols: $\Sigma \cap NT =
\emptyset$.
\item $S$ is a distinguished non-terminal symbol: $S \in NT$ 
\item $I$ is a finite set of finite trees, called \jargon{initial
trees} where
\begin{itemize}
\item interior nodes are labeled by non-terminal symbols;
\item frontier nodes are labeled by terminals or non-terminals;
      non-terminal symbols on the frontier are called 
      \jargon{substitution sites} and are marked for substitution,
      by convention, annotated with a down arrow ($\downarrow$);
\end{itemize}
\item $A$ is a finite set of finite trees, called \jargon{auxiliary
trees} where
\begin{itemize}
\item interior nodes are labeled by non-terminal symbols;
\item frontier nodes are labeled by terminal or non-terminal symbols;
      non-terminal symbols on the frontier are marked for substitution 
      except for one node, called the \jargon{foot node}, by convention,
      annotated with an asterisk (*); the label of the foot node must
      be identical to the label of the root node.
\end{itemize}
\end{enumerate}

The trees in $I \cup A$ are called \jargon{elementary trees} and 
describe the syntactic structure of the basic components of
a language, namely words or collocations.  These trees can be
composed pairwise to build more complex structures, called
\jargon{derived trees}, through the operations of substitution and 
adjunction.  
 
The \jargon{substitution} operation (figure \ref{fig:substitution})
replaces one substitution site of one tree by the tree to be
substituted.  The tree to be substituted must be derived from an initial
tree. When a tree does not have substitution sites, we say that it is
\jargon{closed}.

\begin{figure}
\begin{center}
\includegraphics[scale=0.65]{images/substitution.pdf}
\caption{Substitution}
\label{fig:substitution}
\end{center}
\end{figure}

The \jargon{adjunction} operation can be understood as splicing
an auxiliary tree into another tree (which can be of any type, initial,
auxiliary or derived.)  Let $\alpha$ be a tree containing a
non-substitution node labeled by $X$, and $\beta$ be an auxiliary tree
whose root node is also labeled by $X$.  Adjoining $\beta$ into $\alpha$
is built by (1) excising the sub-tree of $\alpha$ dominated by $n$ (call
it $t$) (2) replacing the foot node of $\beta$ with $t$ to produce an
intermediary structure $\beta'$ and (3) replacing the excised tree in
$\alpha$ with the augmented auxiliary tree $\beta'$.  Nodes on which
adjunction may be performed are called \jargon{adjunction sites}.

\begin{figure}
\begin{center}
\includegraphics[scale=0.65]{images/adjunction.pdf}
\caption{Adjunction}
\label{fig:adjunction}
\end{center}
\end{figure}

\subsubsection{TAG with flat semantics}
\label{sec:tag_semantics}

We augment the TAG formalism by associating each tree in the grammar
with a semantics using the flat representation.  Whenever two trees are
combined (via substitution or adjunction), the semantics of the resulting
tree is calculated by taking the set union of the source tree semantics.
In figure \ref{fig:semantics}, we show the result of the substitution
and adjunction from figures \ref{fig:substitution} and
\ref{fig:adjunction} once semantics is taken into account.

\begin{figure}
\begin{center}
\includegraphics[scale=0.65]{images/semantics.pdf}
\caption{TAG with flat semantics}
\label{fig:semantics}
\end{center}
\end{figure}

Note that it is not realistically feasible to implement the
syntax-semantics interface without introducing a notion of variable
unification.  Without such a mechanism, we would have to create a
separate entry for every possible combination of tree semantic indices,
one for \semexpr{smells(s,j)}, for \semexpr{smells(s,k)}, for
\semexpr{smells(s,l)} and so forth.  Realistic grammars have tree
semantic literals where the arguments are not necessarily constants,
but variables, which simultaneously occur in the nodes of the tree.
Before they are used for generation, trees have to be
\jargon{instantiated}, meaning that unification must be
performed between these variables and the arguments contained in the
input semantics representation. Variable unification on node variables
also occurs during the substitution and adjunction operations, which
prevents incorrect realisations like \natlang{John hates Mary} when the
intended output was \natlang{Mary hates John}.  

Consider a grammar which has a tree \natlang{smells} with the semantics
\semexpr{smells(V,X)} (we indicate variables with capital letters).
Given the input semantics \semexprset{smells(s,j), name(j,john),
funny(s)}, we must first instantiate \natlang{smells} by unifying
\semexpr{V} with \semexpr{s} and \semexpr{X} with \semexpr{j}.  If the
generator were to use a different input semantics, like
\semexprset{smells(s,m), name(m,mary), nice(s)}, then \natlang{smells}
would be instantiated differently.

\subsection{Generation algorithms}
\label{sec:generation_algorithms}

We can see generation as the process of constructing a syntactic tree
\cite{shieber90shd}.  There are two basic approaches to this process,
building the tree from the top-down, or from the bottom-up.  There is
also a hybrid head-driven approach which combines the two basic
strategies.  

These approaches were originally developed in the framework of context
free grammars with recursive semantics.  For consistency with this
thesis, we will maintain a TAG perspective: context free rules are
viewed as shallow TAG trees, and non-terminal child nodes are
substitution sites.  The approaches also assume a recursive semantics,
but rather than risking errors by converting them to a flat
semantics, we will momentarily adopt a recursive semantics.  The
relationship between syntax and semantics is somewhat different here.
Instead of using the union operation to combine tree semantics, we allow
the use of complex variables and let unification do the work.  Grammars
in this section are presented as Prolog DCGs with the lexicon omitted.

\subsubsection{Top-down generation}
\label{sec:td_generation}

The top-down approach consists in working from the root to the leaves.
We select any tree from the grammar with root category \koweytree{s}
(for sentences) and perform unification between its and the input
semantics.  Then we loop around; if the tree has any substitution sites,
we select a tree with the same root as the site, perform substitution
(and unification).  Hopefully doing this closes off all the substitution
sites, but it may very well not, and worse, it may introduce its own
share of sites.  The loop continues until the tree is closed.

\paragraph{Non-termination} This approach can fail to terminate if it is
applied on the wrong grammar, namely, a left-recursive one.  For
instance, the following grammar should be able to produce \natlang{John
smells funny} as a realisation for the input semantics
\semexpr{funny(smell(john))}.  \footnote{In this grammar, categories
encode both syntactic and semantic information.  Specifically, a term of
the form \texttt{Synt(Sem)} denotes a category with syntactic category 
\texttt{Synt} and semantic information \texttt{Sem}.  Shared variables
are used to adequately bind semantic arguments.}

\begin{verbatim} 
s(S)                 --> np(NP), vp(NP,S).     % t1
vp(NP,S)             --> vp(NP,VP), adv(VP,S). % t2
vp(NP,S)             --> v(NP,S).              % t3
v(NP,smell(NP))      --> smells.               % t4 
adv(VP,funny(VP))    --> funny.                % t5
adv(VP,nice(VP))     --> nice.                 % t6
np(john)             --> john.                 % t7
\end{verbatim}

But top-down generation on this grammar might not terminate:

\begin{enumerate}
\item We begin with t1.  Ignoring semantic information, execution of t1
      yields a tree \koweytree{s(np$\downarrow$, vp$\downarrow$)} where,
      as noted above, the rule is represented by a local tree and the
      non-terminals by substitution sites labeled with the rule's
      right hand side syntactic categories.
\item We substitute t7 to produce \koweytree{s(np(john), vp$\downarrow$)}
\item The derived tree still has a substitution site to fill, this
      being \koweytree{vp$\downarrow$}, so we process t2.  
\item t2 introduces its own share of substitution sites.  If we attempt
      to fill the first one (\koweytree{vp$\downarrow$}) by substituting t2
      into it, this brings us back to the same problem and traps us
      in an infinite loop.
\end{enumerate}

\subsubsection{Bottom-up generation}
\label{sec:bu_generation}
\label{sec:intersective_modifiers}

Bottom-up generation builds trees from the lexical items to the root.
First, the \jargon{lexical selection step} selects from the grammar
the trees whose semantics subsumes part of the input semantics, and 
places them in a pool. Then the generator loops: it arbitrarily selects
a pair of trees from the pool and attempts substitution between the two
trees.  If the resulting derived tree is closed and if its semantics
matches the input semantics, then it is added to the output.  Otherwise,
it is put into the pool for future consideration. The loop continues in
this fashion until there are no new pairs to select.

Given the grammar from the previous section (\ref{sec:td_generation}),
a bottom-up generator would find the surface realisation \natlang{John
smells funny} from the input semantics \semexpr{funny(smell(john))}
as follows:

\begin{enumerate}
\item Every tree in the grammar except t5 is selected, resulting in
      a pool that consists of t1 \koweytree{s(np$\downarrow$,
      vp$\downarrow$)}, t2 \koweytree{vp(vp$\downarrow$, adv$\downarrow$)},
      t3 \koweytree{vp(v$\downarrow$)}, 
      t4 \koweytree{v(smells)}, 
      t5 \koweytree{adv(funny)} and t7 \koweytree{np(john)}. 
\item t4 is substituted into t3 and the resulting
      \koweytree{vp(v(smells))} is added to the pool.
\item This is substituted into t2, and the resulting
      \koweytree{vp(vp(v(smells)), adv$\downarrow$)} is added to the
      pool.
\item t5 is then substituted into this, and the resulting
      \koweytree{vp(vp(v(smells)), adv(funny))} is added to the pool.  
\item This is substituted into t1 to produce
      \koweytree{s(np$\downarrow$, vp(vp(v(smells)), adv(funny)))}
\item t7 is substituted into this and the resulting
      \koweytree{s(np(John), vp(vp(v(smells)), adv(funny)))}
      is added to the output. 
\item This tree is syntactically complete and matches the input
      semantics. If we read the leaves of this tree, we extract the
      output sentence \natlang{John smells funny} 
\end{enumerate}

\paragraph{Termination and non-determinism}

A useful constraint for bottom-up algorithms is to only allow trees to
be combined if they do not have any lexically selected trees in common.
This prevents infinitely long strings like \natlang{smells funny funny
funny...} from being generated.  

Once this filter is in place, bottom-up algorithms are guaranteed to
terminate, because the lexical selection step gives them a finite number
of trees to start with.  We can think of the algorithm loop as always
consuming a lexically selected tree until it eventually runs out and
has to stop.

This guarantee of termination comes at the price of highly
non-deterministic behaviour.  The generation loop selects at every
iteration, a pairs of trees to be combined.  If these choices do not
lead to a result, then the generator must make a new set of selections,
either by backtracking, or by having simultaneously considering the
different choices that it can make.

\paragraph{Semantic monotonicity}

One potential disadvantage of this approach is that it requires grammars
to be semantically monotonic.  A \jargon{semantically monotonic} grammar
only has rules in which the semantics of the daughter nodes is contained
in the semantics of the mother node.  For example, any grammar which
includes the following fragment is not semantically monotonic:

\begin{verbatim}
vp(X,Y,callup(X,Y)) --> v(X,Y,callup(X,Y)), np(Y), pp(up). 
v(X,Y,callup(X,Y))  --> call.
pp(up)              --> up.
\end{verbatim}

The intention behind these rules is to generate sentences like
\natlang{John calls Mary up} from the input semantics
\semexpr{callup(john,mary)}.  A top-down generator would simply
instantiate the rule, ignoring the semantics for \natlang{up}.  But a
bottom-up generator could never use this rule, because \verb$up$ is not
in the input semantics, so that there is no reason to select the 
\verb$pp(up)$ rule.

\paragraph{Intersective modifiers} 

Bottom-up generators are also affected with the non-determinism
triggered by \jargon{intersective modifiers}.  Consider the semantics
\semexpr{dirty(smelly(unpleasant(man(x))))}, whose predicates all modify
the individual \semexpr{x}.  This should be realised with the following
grammar as the fragment \natlang{dirty smelly unpleasant man}.  

\begin{verbatim}
n(N,dirty(N))       --> dirty, n(N). 
n(N,smelly(N))      --> smelly, n(N). 
n(N,unpleasant(N))  --> unpleasant, n(N). 
n(man)              --> man.
\end{verbatim}

To get the desired result, however, the generator will try out each and
every one of the following combinations:

\begin{quotation}
 \noindent
 \natlang{dirty smelly unpleasant man},\\
 \natlang{dirty smelly man}, 
 \natlang{dirty unpleasant man}, 
 \natlang{smelly unpleasant man}, \\
 \natlang{dirty man}, 
 \natlang{smelly man},
 \natlang{unpleasant man}, \\
 \natlang{man}
\end{quotation}

That is, it creates $n!$ possible sequences (where $n$ is the number of
modifiers) just to find a single correct answer.  We could even be
generous and impose a sort of word order constraint, so that
\natlang{smelly dirty man} would not be allowed, whereas \natlang{dirty
smelly man} would.  In this case, the number of possible substrings
drops to $2^n$.

\subsubsection{Head-driven generation}

Head-driven generators are based on the observation that every syntactic
tree necessarily has a \jargon{semantic head}, the lowest possible node 
whose semantics is that of the tree.  If we revisit the example in
section \ref{sec:td_generation} (repeated below), we can identify the
verb \natlang{smells} as the semantic head for the verb phrase
\natlang{smells funny} and consequently for the sentence \natlang{John
smells funny}.  The semantic head can be used to guide the generation
process.  In \cite{shieber90shd}, the grammar is preprocessed so that
trees are divided among \jargon{chain rules} whose semantic head is a
child node, and \jargon{non-chain rules} whose semantic head is its root
node. In the example, only t1, t2 and t3 are chain rules.  

\begin{verbatim} 
s(S)                 --> np(NP), vp(NP,S).     % t1
vp(NP,S)             --> vp(NP,VP), adv(VP,S). % t2
vp(NP,S)             --> v(NP,S).              % t3
v(NP,smell(NP))      --> smells.               % t4 
adv(VP,funny(VP))    --> funny.                % t5
adv(VP,nice(VP))     --> nice.                 % t6
np(john)             --> john.                 % t7
\end{verbatim}
% Note: we include t3 and t4 because we need to show how the bu
% part of the head-driven generator works when it encounters a
% chain rule in the middle... or something like that.

The generator works with what is called a \jargon{target node}.  The
first target node has category \koweytree{s} (for sentence) and the
semantics equal to the input semantics. The basic approach is to use
top-down processing of the non-chain rules to identify the semantic head
of the target and use bottom-up processing of the chain rules to connect
the two nodes. Any new substitution sites encountered along the way are
recursively processed as targets.  We can think of the generator as
cycling between three behaviours: leaping (top-down), connecting
(bottom-up) and recursion (top-down).  When tracing the algorithm, it is
useful to imagine a stack of target nodes.  The algorithm does not
actually use such a stack because it is implicit in the recursive
behaviour.

Given the example grammar above and the input semantics
\semexpr{funny(smell(john))}:

\begin{enumerate}
\item We push the desired result \semexpr{s(funny(smell(john)))} on
      the stack.
\item (td) We leap down to t5 \koweytree{adv(funny)} because its
      semantics unifies with \semexpr{funny(smell(john))}. There are no
      child nodes, so we try to connect upwards. 
\item (bu) t5 can serve as the semantic head of chain-rule t2,
      so we perform substitution to produce
      \koweytree{vp(vp$\downarrow$, adv(funny))}.
      We push the child node \semexpr{vp(NP,smell(john))} on the stack.
\begin{enumerate}
\item[a] (td) We leap down to t4 \koweytree{v(smells)} because its
         semantics unifies with \semexpr{smell(john)}.  There are no
         child nodes, so we try to connect upwards.
\item[b] (bu) t4 can serve as the semantic head of chain-rule t3,
         so we perform substitution to produce 
         \koweytree{vp(v(smells))}.  There are no child nodes to fill,
         so we try to connect upwards.
\item[c] (bu) The root node of the above tree unifies with the target
         \semexpr{vp(NP,smell(john))}, so we perform substitution 
         to produce \koweytree{vp(vp(v(smells)), adv(funny))}, and 
         bottom-up processing is complete.  We pop the stack and try to
         connect upwards to the new target.
\end{enumerate}
\item (bu) \koweytree{vp(vp(v(smells)), adv(funny))} can serve as the
      semantic head of chain-rule t1, so we perform substitution to
      produce \koweytree{s(np$\downarrow$, vp(vp(v(smells)),
      adv(funny)))}. We push the child node \semexpr{np(john)} on the
      stack.
\begin{enumerate}
\item[a] (td) we leap down to t7 because its semantics unifies
         with \semexpr{john}.  There are no child nodes, so we now try
         to connect upwards.
\item[b] (bu) The t7 root unifies with the target, so we perform
         substitution to produce \koweytree{s(np(John),
         vp(vp(v(smells)), adv(funny)))}.  We pop the stack and try to
         connect upwards.
\end{enumerate}
\item (bu) The t1 root unifies with the target, so bottom-up processing
      is complete. The stack is empty and we have a semantically
      complete result, from which we extract the sentence \natlang{John
      smells funny}.
\end{enumerate}

The head-driven approach combines the most attractive properties of
bottom-up and top-down processing, namely termination and efficiency.
Like bottom-up algorithms, the head driven algorithms lexically select
and then consume a finite set of trees on the basis of their semantics.
The main difference is that the lexical selection (leaping) is
interspersed with the generation loop as opposed to being set in the
very beginning, but the essential property of using the semantic
information is retained.  The efficiency of these approaches comes from
the top-down ability to guide the selection items by syntactic criteria.
This is most evident with intersective modifiers; given the semantics
\semexpr{dirty(smelly(unpleasant(man(x))))} and the grammar fragment in
section \ref{sec:intersective_modifiers}, the head-driven algorithm
would zip down the rules, \natlang{dirty}, \natlang{smelly},
\natlang{unpleasant} in a similar fashion to a top-down algorithm,
except that this zipping would consist of trivial leaping and connecting
steps.

\section{The GenI surface realiser}
\label{sec:geni}

Within the INRIA ARC GenI, Carlos Areces developed a TAG-based surface
realiser in Haskell. This is simultaneously a bottom-up and head-driven
generator, bottom-up in its basic tree-building strategy and head-driven
by virtue of TAG and the use of entire trees as basic linguistic units.  

The TAG formalism enjoys what \cite{gardent01gen} call an ``extended
domain of locality'' in that elementary trees can be of arbitrary depth.
This makes the requirement of semantic monotonicity much more palatable.
In the case of \natlang{John calls Mary up}, the word \natlang{up} would
simply be incorporated into an elementary tree
\koweytree{s(np$\downarrow$, v(calls), np$\downarrow$, pp(up))}.
Furthermore the trees with their extra depth and complexity contain
syntactic information that would normally be built in the top-down phase
of head-driven generation.  Using TAG thus allows us to imbue a
bottom-up generator with head-driven characteristics.

The GenI generator also includes a number of optimisations, including
chart generation, ordered substitution and delayed adjunction, which we
now discuss.

\subsection{Chart generation}
\label{sec:chart}

Some algorithms for natural language parsing use a tabulation method to
improve efficiency.  These \jargon{chart algorithms} can also be
adapted or generalised to do generation \cite{shieber88uap,kay96chart}.
We examine chart methods from the perspective of our thesis, namely,
with bottom-up processing and with trees as basic linguistic units.

\subsubsection{Basic architecture}

The essence of bottom-up processing is a loop that compares two trees
and when possible, combines them to make a new tree, until it runs out
of trees to compare.  Chart algorithms will modify this process by 
avoiding recomputation of redundant intermediate structures.

We introduce two data structures called an \jargon{agenda} and a
\jargon{chart}.  The agenda is a form of to-do list and it contains the
trees that need to be processed. The chart contains the trees that have
already been processed.  In the beginning of the processing, the agenda
is initialised with those trees whose semantics subsumes the input
semantics.  During the course of the processing, an item is removed from
the agenda, placed on the chart and combined with items from the chart.
Any tree that results from the combination is placed on the agenda,
unless it is syntactically complete (closed) and its semantics
exactly matches the input semantics, in which case it is added to the
results.  We loop in this fashion until the agenda is completely empty,
which means that all comparisons have been made.

\subsubsection{An example}
\label{sec:chart_example}

\begin{figure}
\begin{center}
\begin{tabular}{|ll|ll|} 
\hline
initial trees & & 
auxiliary trees & \\ 
\hline
\koweytree{np(John)}    & \semexprset{name(X,john)}
& \koweytree{v(v*,adv(funny))} & \semexprset{funny(X)} \\
\koweytree{s(np$\downarrow$, v(smells))} & \semexprset{smell(S,X)} && \\
\hline
\end{tabular}
\end{center}
\caption{TAG grammar for \natlang{John smells funny}}
\label{fig:smells_funny}
\end{figure}

Given the grammar in figure \ref{fig:smells_funny} and the input
semantics \semexprset{smell(s,x), funny(s), name(x,john)}:

\begin{enumerate}
\item We initialise the agenda with the trees \tautree{smells}~,
      \tautree{funny}~, and \tautree{John}~, as in figure
      \ref{fig:semantics} (items on agenda: 3, chart: 0).
\item Item \tautree{smells}~ is removed from the agenda.  There are no
      items to compare it with, so it is simply placed on the chart
      (items on agenda: 2, chart: 1).
\item Item \tautree{funny}~ is removed from the agenda.  It is compared with
      \tautree{smells}~ which results (by adjunction) in the tree
      \tautree{smells~funny}.  Item \tautree{funny} is placed on the
      chart, and \tautree{smells~funny} is placed on the agenda.  (items
      on agenda: 2, chart: 2).  
\item Item \tautree{smells~funny} is removed from the agenda.  It is
      unsuccessfully compared with \tautree{smells}~ and
      \tautree{funny}~ and is placed on the chart.  (items on agenda: 1,
      chart: 3).  
\item Item \tautree{John}~ is removed from the agenda.  It is
      unsuccessfully compared with \tautree{funny}~ and successfully
      compared with \tautree{smell}~ and with \tautree{smells~funny}.
      \tautree{John}~ is placed on the chart.  The result
      \tautree{John~smells}~ is placed on the agenda; on the other hand
      the other result \tautree{John~smells~funny}~ is both syntactically
      and semantically complete, so it is added to the output.
      (items on agenda: 1, chart: 4).  
\item Item \tautree{John~smells}~ is removed from the agenda.  It is
      unsuccessfully compared with \tautree{John}~, \tautree{smells}~,
      and \tautree{smells~funny}~; and successfully compared with
      \tautree{funny}. The resulting tree \tautree{John~smells~funny}~
      is syntactically and semantically complete, so it is added to the
      output.  Item \tautree{John~smells}~ is placed on the chart.
      (items on agenda: 0, chart: 5).  
\item Since the agenda is empty, the processing ends.
\end{enumerate}

We can observe that there are two distinct derivations for the derived
tree \tautree{John~smells~funny}~, one which begins by creating
\tautree{smells~funny}~ and the other by creating
\tautree{John~smells}~.  This is not an artifact of chart processing but
of using a naive bottom-up strategy. The GenI surface realiser employs a
number of mitigating tactics which we will discuss in sections
\ref{sec:ordered_substitution} and \ref{sec:delayed_adjunction}.  

\subsubsection{Intermediate storage}

The chart data structure acts as a form of storage for intermediate data
structures and thus eliminates a great number of redundant
recalculations.  For example, the trees for \natlang{unpleasant smelly
man} and \natlang{dirty smelly man} share the subtree for
\natlang{smelly man}. Using a chart algorithm means that the
latter subtree is only computed once and reused as many times as needed.

\subsubsection{Chart indexing}

The algorithm as we have presented it does not yet exploit the full
potential of tabulation.  Doing so would depend on the availability of a
good \jargon{indexing scheme} which would allow us to reduce the 
search space.

Indexing schemes are more evident for natural language parsing than for
generation; we assign every tree an index for its initial and final
string position. So, to parse \natlang{John saw Mary with Peter} the
agenda would be initialised with (0,1,\tautree{John}),
(1,2,\tautree{saw}), (2,3,\tautree{mary}), (3,4,\tautree{with}),
(4,5,\tautree{Peter}).  This prevents some clearly unnecessary
comparisons from taking place. The tree for (1,2,\tautree{saw}) is
compared with that for (2,3,\tautree{Mary}) because they articulate
around the common index 2.  But it is not compared with
(4,5,\tautree{Peter}) because their indices have nothing to do with each
other.  The use of indices partitions the search space between parts
that are clearly avoidable and those that need to be explored.  

%The latter can also be further consolidated to avoid redundant work
%from equivalent comparisons.  Trees which are equivalent for the
%purposes of comparison are grouped together into \jargon{edges}.  In
%chart algorithms, it is not the trees that get compared but their
%representative edges.  Note that the example above can be read in two
%ways: either John was with Peter when he saw Mary, or John saw Mary
%when she was with Peter.  This means that the verb phrase
%(1,5,\tautree{saw~mary~with~peter}) can either be formed by combining
%(1,3,saw mary) and \tautree{3,5,peter) or by combining (1,2,saw) with
%(2,5,mary with peter).  If we consider trees with the same indices and
%root category as belonging to the same edge, then we only need to make
%a single comparison between the edges for (1,5,saw mary with peter) and
%that for (0,1, john).  

Numerical indices serve as the ``natural points of articulation'' for a
parsing algorithm \cite{kay96chart}, but they cannot be used for
generation because linear order of literals is not natural to semantic
expression.  The semantics \semexprset{name(j,john),  smells(s,j)} is
equivalent to \semexprset{smells(s,j), name(j,john)}, whereas the
English phrases ``John smells'' and ``smells John'' are not.  Put
another way, the surface realisation task is equivalent to that of
parsing a completely free word order language. GenI does not implement
an indexing scheme, but we will propose one in section
\ref{sec:distinguished_chart}.

\subsection{Ordered substitution}
\label{sec:ordered_substitution}

\begin{figure}
\begin{center}
\begin{tabular}{|ll|}
\hline
initial trees & \\
\hline
\koweytree{s(np$\downarrow$, v(hates), np$\downarrow$)} &
\semexprset{hate(X,Y,Z)} \\
\koweytree{np(mary)} & \semexprset{name(X,mary)} \\ 
\koweytree{np(john)} & \semexprset{name(X,john)} \\
\hline
\end{tabular}
\end{center}
\caption{TAG grammar for \natlang{Mary hates John}}.
\label{fig:mary_hates_john}
\end{figure}

We observed in section \ref{sec:chart_example} that it is possible to
produce the same derived tree in different ways.  In the GenI generator,
some of these redundant derivations are prevented by using ordered
substitution.  Using the grammar in figure \ref{fig:mary_hates_john},
the derived tree \koweytree{s(np(Mary), v(hates), np(John))} can be built
either by first substituting \koweytree{np(Mary)} into the tree
\koweytree{s(np$\downarrow$, v(hates), np$\downarrow$)} to get
\koweytree{s(np(Mary), v(hates), np$\downarrow$)} and then substituting
\koweytree{np(John)} into that tree.  Alternatively, it can be built by
first substituting \koweytree{np(John)} to get
\koweytree{s(np$\downarrow$,v(hates), np(John))}, into which
\koweytree{np(Mary)} is substituted.  If we perform substitution in a
fixed order, for example left-to-right, then the second derivation is
eliminated.  Since the trees \koweytree{np(Mary)} and \koweytree{np(John)}
are inserted into two different substitution sites, we know that they
do not interact, and that rejecting the redundant derivation will not
prevent any legitimate results from being produced.

Ordered substitution has the secondary benefit of preventing certain
intermediary structures from being built.  Specifically, no derived tree
can be built in which a substitution site is filled whose predecessor
(in the chosen substitution order) is not. For instance assuming a left to
right substitution order, the tree \koweytree{s(np$\downarrow$,v(hates),
np(John))} in the above example will not be produced. 

\subsection{Delayed adjunction}
\label{sec:delayed_adjunction}

\begin{figure}
\begin{center}
\begin{tabular}{|ll|ll|} 
\hline
initial trees & & 
auxiliary trees & \\ 
\hline
\koweytree{np(man)} & \semexprset{man(X)} 
& \koweytree{np(det(the),np*)} & \semexprset{def(X)}\\          
\koweytree{s(np$\downarrow$, v(rejects), np$\downarrow$)} &
\semexprset{reject(X,Y,Z)} 
&
\koweytree{np(adj(beautiful),np*)} & \semexprset{beautiful(X)} \\
\koweytree{np(woman)} & \semexprset{woman(X)} &  
\koweytree{np(adj(heartless),np*)} & \semexprset{heartless(X)} \\    
&& \koweytree{np(adj(smelly),np*)}  & \semexprset{smelly(X)} \\
&& \koweytree{np(adj(dirty),np*)}   & \semexprset{dirty(X)} \\
\hline
\end{tabular}
\end{center}
\caption{TAG grammar for \natlang{the beautiful heartless woman rejects
the dirty smelly man}}
\label{fig:woman_rejects}
\end{figure}


TAG adjunction provides us with means for coping with the
problem of intersective modifiers (section
\ref{sec:intersective_modifiers}).  Following \cite{carroll99eff}, we
break generation into two phases.  In the first phase, we perform
generation but using only TAG substitution.  Then we discard any trees
which have substitution nodes that remain to be filled, because we know
that these could never lead to a result. Then we switch to the second
phase, which consists of generation using only adjunction.

We can examine the procedure in more detail in an example.  Consider the
target semantics \semexprset{def(m), dirty(m), smelly(m), man(m),
def(w), woman(w), beautiful(w), heartless(w), reject(r,w,m)}.  Using the
grammar in figure \ref{fig:woman_rejects}, this should be realised as
the sentence \natlang{the beautiful heartless woman rejects the dirty
smelly man}.  

\begin{enumerate}
\item In the first phase, TAG substitution is performed between
      \tautree{rejects} and \tautree{man} resulting in 
      \tautree{rejects~man}.  This derived tree is then substituted
      with \tautree{woman} to get the result
      \tautree{woman~rejects~man}.
\item Now trees that still have substitution sites, i.e.
      \tautree{rejects} are discarded. This leaves us with the trees
      \tautree{man}, \tautree{woman}, \tautree{woman~rejects~man}; and a
      chart containing \tautree{the}, \tautree{dirty}, \tautree{smelly},
      \tautree{beautiful} and \tautree{heartless}.
\item In the second phase, TAG adjunction is performed between the
      auxiliary and the non-auxiliary trees.  This eventually produces a
      derived tree that covers the input semantics, the leaf nodes of
      which can be read to produce the string \natlang{the beautiful
      heartless woman rejects the dirty smelly man}.
\end{enumerate}

Delayed adjunction does not solve the problem of intersective modifiers
but largely contains the number of intermediate structures being built.
The above example still produces an exponential number of redundant
substrings (such as \natlang{the beautiful woman rejects the man}).
However by getting substitution over with in the beginning,  we prevent
a large number of useless interactions such as trying to substitute
\tautree{dirty~man} into \tautree{woman~rejects}, or into
\tautree{the~woman~rejects}, or \tautree{beautiful~woman~rejects}, and
so forth.

\subsection{The GenI algorithm}

\begin{figure}
\begin{center}
\includegraphics[scale=1]{images/gen.pdf}
\caption{The GenI algorithm}
\label{fig:geni_algo}
\end{center}
\end{figure}

We present the basic GenI generator in figure \ref{fig:geni_algo}, as
well as algorithms \ref{alg:geni} and \ref{alg:geni_helper}.  This
includes a lexical selection step (section \ref{sec:bu_generation}),
delayed adjunction via separate substitution and adjunction phases, a
chart-processing algorithm (section \ref{sec:chart}) with a trivial
implementation of the chart data structure.  We assume that the
functions \textproc{Substitution} and \textproc{Adjunction} have already
been defined and that they return a set of trees (an empty set if the
operation fails).  Ordered substitution is assumed to be implemented in
the \textproc{Substitution} function.  For simplicity's sake, we
completely ignore variable unification.

Throughout this thesis, we assume a language and/or libraries with the
following features:

\begin{itemize}
\item complex assignments: $\tuple{foo,bar} \leftarrow baz$ which
      assumes that $baz$ is a tuple with compatible type, and $foo$ gets 
      assigned to the first value of the tuple, and $bar$ to the second.
\item lists: ``$:$'' indicates list concatenation, and ``$[]$''
      indicates the empty list. 
\item complex types with named components (indicated by the operator
      ``$.$''): for example, $tree.semantics$ should return the
      component $semantics$ of the complex object $tree$
\item associative arrays: If $arr$ is an associative array, then
      $arr[k]$ should retrieve the item associated with key $k$ or 
      return the value $NULL$ if there is none.  $arr.keys$ should
      always return the set of keys used in $arr$.  $\emptyset$ should
      be overloaded to also mean the empty associative array, and $foo
      \in bar$ to mean the item $foo$ in array $bar$, regardless of its
      key.
\end{itemize}

\begin{algorithm}
\caption{Basic GenI algorithm}
\label{alg:geni}
\begin{algorithmic}[1]
\Function{Generator}{$Grammar$,$InputSem$}
\State $Agenda \leftarrow$ \CallFunction{LexSelection}{$Grammar$,$InputSem$}
\State $Chart \leftarrow \emptyset$
\State $output \leftarrow \emptyset$
\State $step \leftarrow $ substitution
\While {$step \neq$ done}
  \State $aTree \leftarrow$ any tree $\in Agenda$
  \State $Agenda \leftarrow Agenda \setminus \set{aTree}$
  \State $Chart \leftarrow$ \CallFunction{ChartInsertion}{$Chart$,$aTree$}
  \ForAll {$cTree \in $ \CallFunction{ChartRetrieval}{$Chart$,$aTree$}}
    \If {$step =$ substitution}
      \State $derived \leftarrow$ \CallFunction{Substitution}{$cTree,aTree$} 
                           $\cup$ \CallFunction{Substitution}{$aTree,cTree$}
    \Else 
      \State $derived \leftarrow$
                   \CallFunction{Adjunction}{$cTree,aTree$} 
    \EndIf
    \State $\tuple{complete,incomplete} \leftarrow$
           \CallFunction{SuccessCheck}{$derived$,$InputSem$} 
    \State $Agenda \leftarrow Agenda \cup incomplete$
    \State $output \leftarrow output \cup complete$
  \EndFor \Comment{end $cTree$ loop}
  \If {$Agenda = \emptyset$}
     \If {$step =$ substitution}
       \State $step \leftarrow$ adjunction
       \State $\tuple{Chart,Agenda} \leftarrow$
              \CallFunction{PartitionTrees}{$Chart$}
     \Else
       \State $step \leftarrow$ done 
     \EndIf
  \EndIf
\EndWhile \Comment{end $step$ loop}
\State \Return $output$
\EndFunction
\end{algorithmic}
\end{algorithm}

\begin{algorithm}
\caption{Helper functions for GenI algorithm}
\label{alg:geni_helper}
\begin{algorithmic}[1]
\Function{LexSelection}{$Grammar$,$InputSem$}
\State $selected \leftarrow \emptyset$
\ForAll {$tree \in Grammar$}
  \If {$tree.semantics \subseteq InputSem$} 
      \State $selected \leftarrow selected \cup \set{tree}$
  \EndIf
\EndFor 
\State \Return ${selected}$
\EndFunction
\Statex
\Function{SuccessCheck}{$results$,$InputSem$}
\State $complete   \leftarrow \emptyset$
\State $incomplete \leftarrow \emptyset$
\ForAll {$tree \in results$}
  \If {$tree.substitutionSites = \emptyset$ and 
       $tree.semantics = InputSem$}
      \State $complete \leftarrow complete \cup \set{tree}$
  \Else
      \State $incomplete \leftarrow incomplete \cup \set{tree}$
  \EndIf
\EndFor 
\State \Return $\tuple{complete,incomplete}$
\EndFunction
\Statex
\Function{PartitionTrees}{$Chart$}
\State $initial   \leftarrow \emptyset$
\State $auxiliary \leftarrow \emptyset$
\ForAll {$tree \in Chart$}
\If {$tree.substitutionSites = \emptyset$}
  \Comment { only take complete trees }
  \If {$tree$ is an auxiliary tree}
    \State $auxiliary \leftarrow initial \cup \set{tree}$
  \Else
    \State $initial   \leftarrow initial \cup \set{tree}$
  \EndIf
\EndIf
\EndFor
\State \Return $\tuple{initial,auxiliary}$ 
\EndFunction
\Statex
\Function{ChartRetrieval}{$Chart$,$tree$}\Comment{these will be 
reimplemented in section \ref{sec:chart_sharing}}
\State \Return $Chart$ 
\EndFunction
\Statex
\Function{ChartInsertion}{$Chart$,$tree$}
\State \Return $Chart \cup \set{tree}$
\EndFunction
\end{algorithmic}
\end{algorithm}

% -------------------------------------------------------------------------
\chapter{Optimising the GenI surface realiser}
\label{chp:optimisations}
% -------------------------------------------------------------------------

This chapter presents the work carried out during the DEA stage.  In
section \ref{sec:optimisations} we describe four optimisations we
introduced in the GenI generator.  We detail in \ref{sec:implementation}
the implementation of these optimisations and various other extensions
to the generator.  Finally, section \ref{sec:results} discusses the
effect of the optimisations on the overall performance of the generator.

\section{Optimisations}
\label{sec:optimisations}

This thesis is mainly about four optimisations for the surface realiser
that we designed and implemented.  The two simplest deal with the
combinatorics of adjunction.  \jargon{Ordered adjunction} works
similarly to ordered substitution (section
\ref{sec:ordered_substitution}) and imposes an order on the evaluation
of adjunction sites in the tree.  \jargon{Semantic filtering} on the
other hand improves on the delayed adjunction mechanism by detecting and
removing intermediary structures that cannot lead to a semantically
complete result.  The next optimisation, \jargon{polarity automata},
represents the bulk of this thesis.  We use linguistic information
(which has to be added to grammars) to identify and filter out the
sequences of lexical items which cannot lead to a successful realisation.
Finally, we use \jargon{chart sharing} to refine the polarity automata,
thereby reducing the overhead it introduces.

\subsection{Ordered adjunction}
\label{sec:ordered_adjunction}

This optimisation has a similar effect to ordered substitution (section
\ref{sec:ordered_substitution}), that is, it prevents the generation of
certain redundant derivations and intermediate structures for the same
tree.  As with ordered substitution, the adjunction sites of a given
derived tree are tried in a given order.  Note that in contrast to
substitution sites, adjunction sites do not necessarily need to be used.
Therefore, once all auxiliary trees have been considered for a
given site, the next site will be considered for adjunction.  Another
difference with ordered substitution is that adjunction sites may
be used more than once.  To account for this, we repeat the adjunction 
attempts until no further adjunctions are possible.

\subsection{Semantic filtering}
\label{sec:sfilter}

One possible way to improve the delayed adjunction mechanism is to make
better use of the semantics.  As discussed in section
\ref{sec:delayed_adjunction}, delayed adjunction means using separate
substitution and adjunction phases, and only letting the trees without
any remaining substitution sites into the adjunction phase.  This
filtering criterion could benefit from further tightening. To see this,
consider again the example discussed in section
\ref{sec:delayed_adjunction}.  In this example, at the stage where
adjunction starts to apply, the trees available to the generator
are given in figure \ref{fig:semantic_filtering}.  Now note that of the
three non-auxiliary trees, only one can yield a sentence covering the
entire input semantics when combined with the auxiliary trees.
The optimisation we now introduce aims at
filtering the other two trees away; since their use cannot
possibly lead to a successful realisation, such trees can be safely ignored. 
Concretely, we modify the function \textproc{PartitionTrees}
in algorithm \ref{alg:geni_helper} to calculate the union of all
auxiliary tree semantics, which we call $\phi_A$, and delete from the
agenda, any derived tree with semantics $\phi_s$ if $\phi_A \cup \phi_s
\neq \phi_i$, where $\phi_i$ is the input semantics.  This leaves behind
only the trees which have a chance of fulfilling the input semantics via
adjunction, in this case \koweytree{s(np(woman), vp(rejects), np(man))},
as we see in figure \ref{fig:semantic_filtering2}.  When combined with
the auxiliary trees, this will indeed produce the sentence \natlang{the
beautiful heartless woman rejects the dirty smelly man}.


\begin{figure}
\begin{center}
\begin{tabular}{|l|l|} 
\hline
non-auxiliary trees & auxiliary trees \\
\hline
\koweytree{np(man)}   & \koweytree{np(det(the),np*)}           \\
\koweytree{np(woman)} & \koweytree{np(adj(beautiful),np*)}     \\ 
\koweytree{s(np(woman), v(rejects), np(man))}  
& \koweytree{np(adj(heartless),np*)} \\ 
& \koweytree{np(adj(smelly),np*)}    \\      
& \koweytree{np(adj(dirty),np*)}     \\    
\hline
\end{tabular}
\end{center}
\caption{After the substitution phase in delayed adjunction}
\label{fig:semantic_filtering}
\end{figure}

\begin{figure}
\begin{center}
\begin{tabular}{|l|l|} 
\hline
non-auxiliary trees & auxiliary trees \\
\hline
\koweytree{s(np(woman), v(rejects), np(man))}  
& \koweytree{np(det(the),np*)}           \\
& \koweytree{np(adj(beautiful),np*)}     \\ 
& \koweytree{np(adj(heartless),np*)} \\ 
& \koweytree{np(adj(smelly),np*)}    \\      
& \koweytree{np(adj(dirty),np*)}     \\    
\hline
\end{tabular}
\end{center}
\caption{Delayed adjunction with semantic filtering}
\label{fig:semantic_filtering2}
\end{figure}

\subsection{Polarities}
\label{sec:polarity_overview}

Natural language often permits many different ways to express the same
idea. For example, we can write in the following table some possible
expressions for the literals \semexpr{gift(g), cost(g,x)} and
\semexpr{high(x)}. \footnote{For simplicity reasons we here make the
simplifying assumption that determiner and noun form a single lexical
item.  Note also that lexical items in a TAG can have multiple anchors
so that for instance the initial tree for the predicative noun
\natlang{costs} is anchored with both the noun \natlang{costs} and the
preposition \natlang{of}.}

\begin{center}
\begin{tabular}{|c|c|c|}
\hline
\semexpr{gift(g)}  & \semexpr{cost(g,x)}    & \semexpr{high(x)} \\
\hline
\tautree{the~gift}    & \tautree{the~cost~of}  & \tautree{is~high} \\
\tautree{the~present} & \tautree{costs}        & \tautree{much}    \\
\hline
\end{tabular}\\
\end{center}

The table above shows some \jargon{lexical ambiguity} for each literal.
As in the table, ambiguity can come from the availability of different
lexical items to express the same literal, but it can also come from the
many uses of a single word.  In fact in a average-sized TAG, lexical
ambiguity is usually very high.  For instance, the core TAG for French
developed by Anne Abeille in Paris 7 counts an average of 150 lexical
items for verbs.

We can think of generation partly as selecting a combination of lexical
items, which when taken together, verbalise the entire input semantics.
The number of possible combinations is $\prod_{1 \leq i \leq n}{a_i}$
with $a_i$ the degree of lexical ambiguity for the $i$th literal, and
$n$, the number of literals in the input semantics.  So in the above
example, we need to explore $2 \times 2 \times 2=8$ combinations.  In
other words, the size of the search space is exponential with respect to
the number of literals.

The ``polarity optimisation'' we are now presenting is based on the
observation that a large number of the initially possible combinations
are syntactically invalid, i.e. cannot possibly lead to a successful
realisation.  For instance, although combinations like \natlang{the
gift}, \natlang{the cost of}, \natlang{a lot} cover the entire
semantics, they cannot be stitched together to form a complete
expression.  This leaves a considerable margin for an optimisation to
detect and filter out such combinations ahead of time.  Indeed, such
an optimisation has been successfully put to work in a parser developed
by Guillaume Bonfante, Bruno Guillaume and Guy Perrier for
interaction grammars \cite{perrier02ig,bonfante03leopar}.  The work we
present here investigates how this idea can be transposed to the
TAG-based surface realisation task.

The optimisation involves the addition to the grammar of some additional
linguistic knowledge, the so called ``polarities''; the construction and
the pruning of a finite state automaton on the basis of this
information; and the use of this automaton to filter out the necessarily 
invalid combinations.  We now discuss each of these in turn.

\subsubsection{Polarity keys}
\label{sec:polarity_keys}

We augment our grammars with some explicit combinatoric knowledge.  Each
lexical item is associated with a set of \jargon{polarity keys} where
each key is a label, and a positive or negative integer which we call a
\jargon{charge}.  Polarity keys allow trees to provide hints about which
other trees they may combine with.  Below, for instance, we assign to
the trees \tautree{costs} and \tautree{is~high} the keys $\set{-1np}$
meaning that these verb phrases ``require'' an $np$ tree.  To balance
these, we assign $\set{+1np}$ to the noun phrase trees.  

Any combination of lexical items which does not have a net charge of
zero is necessarily syntactically invalid. For instance, \natlang{the
cost of the gift much} has a charge of $+1np$, so it is clearly not a
solution.  In contrast, \natlang{the cost of the gift is high} has a
charge of $0np$, which does not \emph{guarantee} syntactic compatibility
but suggests that the combination is worth exploring.

\begin{center}
\begin{tabular}{|c|c|c|}
\hline
\semexpr{gift(g)}  & \semexpr{cost(g,x)}    & \semexpr{high(x)} \\
\hline
\tautree{the~gift} \color{blue}{+1np} & 
\tautree{the~cost~of}  &
\tautree{is~high} \color{red}{-1np}\\
%
\tautree{the~present} \color{blue}{+1np} & 
\tautree{costs} \color{red}{-1np}      & 
\tautree{much}   \\
\hline
\end{tabular}\\
\end{center}

\subsubsection{Automaton construction}

\paragraph{Construction for a single polarity key}

The construction and use of the automata fits in between the lexical
selection and the generation step.  It requires a table like that in
section \ref{sec:polarity_keys} where columns are indexed by the
literals of the input semantics, and rows contain the lexical items
whose semantics includes the column index. Given such a table, a 
combination of lexical items covering the input semantics corresponds to
a path across the table where we select exactly one row for every
column. The finite-state automata we build is a way to represent the
set of possible combinations and to associate with each of these
combinations the polarity charge assigned to it by the grammar.
We refer to these automata as \jargon{polarity automata}.

Algorithm \ref{alg:simpleaut} presents a simple version of the
automaton construction, in which we assume that we only have to deal
with a single polarity key.  Each state $\tuple{L,C}$ in the automaton
represents the set of paths up to the $L$th literal that have a net
charge of $C$.  Any transitions in the automaton are from some state
$\tuple{L,C_1}$ to $\tuple{L+1,C_2}$ and has as label, the name of one
of the lexical items verbalising the literal $L$.

We initialise the automaton with a state $\tuple{0,0}$ that does not
cover any literal.  Then we augment the automaton one literal at
time (lines \algref{alg:simpleaut}{alg:simpleaut:loopstart} to
\algref{alg:simpleaut}{alg:simpleaut:loopend}) until reaching the last
literal $N$.  If we declare the final state of the automaton to be
$\tuple{N,0}$, and prune it (algorithm \ref{alg:prune}), we are left
with a minimised automaton \cite{hopcroft79intro} that only represents
the combinations with zero net charge. The final state automaton for the
example discussed above (section \ref{sec:polarity_keys}) is given in
figure \ref{fig:polarity_automaton_pruned}.

\begin{figure}
\begin{center}
\includegraphics[scale=0.5]{images/basicaut-gpruned.pdf}
\caption{Simple polarity automaton (pruned)}
\label{fig:polarity_automaton_pruned}
\end{center}
\end{figure}

\begin{figure}
\begin{tabular}{|l|p{12.5cm}|} 
\hline
\textbf{name} & \textbf{type and use} \\
\hline
$N$ & the number of literals in the input semantics\\
$Labels$ & the set of integers from $1$ to $N$\\
$LE$ & ($Labels \rightarrow Lex$) a function mapping elements of 
$Labels$ to sets of lexical entries\\
$Pol'$ & ($Lex \rightarrow Integers$) a function mapping a
lexical entry to a charge\\
$Automaton$ & a polarity automaton $\tuple{
States,InitialState,FinalStates,Transitions }$\\
\hline
$PolKeys$ & a set of polarity keys\\
$Pol$ & ($Lex \times PolKeys \rightarrow Integers$) a function mapping a
lexical entry and a given polarity key to a charge \\
$LSem$ & ($Lex \rightarrow \set{Integers}$) a function mapping a
lexical entry to a subset of $Labels$\\
\hline
\end{tabular}
\caption{Special variables used in polarity automaton algorithms}
\label{fig:polarity_vars}
\end{figure}

\begin{algorithm}
\caption{Construction of a simple polarity automaton}
\label{alg:simpleaut}
\begin{algorithmic}[1]
\Function{SimpleAut}{$Key$,$LE$,$Pol$,$N$}
\Comment {$Key \in PolKeys$}
\State $InitialState \leftarrow \tuple{ 0,0 }$
\State $FinalStates   \leftarrow \set{\tuple{ N,0 }}$ 
\State $States \leftarrow \emptyset$
\State $curStates \leftarrow \set{InitialState}$
\For {$i,0 < i < N$}
  \label{alg:simpleaut:loopstart}
  \State $newStates \leftarrow \emptyset$
  \Comment{recently created items}
  \ForAll {$st \in newStates$}
    \State $\tuple{\_,\_,c} \leftarrow st$
    \ForAll {$lex \in LE(i)$} 
    \Comment{see figure \ref{fig:polarity_vars}}
      \State $newSt \leftarrow \tuple{ i,c + Pol'(lex) }$
      \State $newStates \leftarrow newStates \cup \set{newSt}$
      \State $States \leftarrow States \cup {newSt}$
      \State $Transitions \leftarrow Transitions \cup 
             \set{\tuple{ st,lex,newState }}$
    \EndFor
  \EndFor
  \State $curStates \leftarrow newStates$ 
  \label{alg:simpleaut:loopend}
\EndFor
\State $Automaton \leftarrow \PolAutomaton$
\State \Return \CallFunction{PruneAut}{$Automaton$} 
\EndFunction
\end{algorithmic}
\end{algorithm}

\begin{algorithm}
\caption{Automaton pruning}
\label{alg:prune}
\begin{algorithmic}[1]
\Function{PruneAut}{$Automaton$}
\State $\PolAutomaton \leftarrow Automaton$ 
\State $nonfinal \leftarrow States \setminus FinalStates$
\State $blacklist \leftarrow \{st \in nonfinal$ 
       s.t. \CallFunction{HasNoTrans}{$st$}$\}$
\While {$blacklist \neq \emptyset$}
  \State $newBlacklist \leftarrow \emptyset$
  \ForAll {$s \in blacklist$}
    \ForAll {$\tuple{ p,l,s } \in Transitions$}
      \State $States \leftarrow States \setminus {s}$
      \State $Transitions \leftarrow Transitions \setminus \set{\tuple{ p,l,s }}$
      \If {\textproc{HasNoTrans}(p)}
         \State $newBlacklist \leftarrow newBlacklist \cup \set{p}$
      \EndIf
    \EndFor
  \EndFor
  \State $blacklist \leftarrow newBlacklist$
\EndWhile
\State \Return $\tuple{ States,InitialState,FinalStates,Transitions }$
\EndFunction
\Statex
\Function{HasNoTrans}{st}
\State $trans \leftarrow \set{\tuple{ st,\_,\_ } \in Transitions}$
\State \Return $trans = \emptyset$
\EndFunction
\end{algorithmic}
\end{algorithm}

\paragraph{Generalisation for multiple polarity keys}

We can generalise the optimisation to the use of several polarity keys.
Each key acts as a linguistic filter, so the more keys we take into
account, the more filtering we accomplish.  One way to do this is to
construct a polarity automaton for each polarity key and then intersect
them.  For greater efficiency, we instead perform the intersection
incrementally, that is, we build each successive polarity automaton
using the previous one as a skeleton.

This requires us to change the definition of states to a tuple
$\tuple{L,C}$ where $C$ is no longer a single charge, but a list of
them.  (We actually augment it to the three-tuple $\tuple{L,V,C}$ but
this extra item can be ignored until section \ref{sec:multisem}).  The
general architecture (algorithms \ref{alg:polaut} and \ref{alg:seedaut})
begins with a \jargon{seed automaton} that does not account for any
polarity keys.  We then iterate through the polarity keys, each time
building an intersected automaton that accounts for every key up to the
current one.  Each iteration also includes a pruning step, which we saw
in algorithm \ref{alg:prune}.

\begin{algorithm}
\caption{Polarity automaton construction with incremental intersection}
\label{alg:polaut}
\begin{algorithmic}[1]
\Function{PolAut}{$PolKeys$,$Lex$,$Pol$,$N$,$LE$}
\State $Automaton \leftarrow$ \CallFunction{SeedAut}{$Lex$,$N$,$LE$}
\ForAll {$key \in PolKeys$}
  \State $Automaton \leftarrow$ 
         \CallFunction{BuildAut}{$Automaton$,$key$,$Pol$}
  \State $Automaton \leftarrow$ 
         \CallFunction{PruneAut}{$Automaton$}
\EndFor
\State \Return $Automaton$
\EndFunction
\Statex
\Function{BuildAut}{$Automaton$,$Key$,$Pol$}
\Comment {$Key \in PolKeys$}
\State ${\tuple{ \_,oldInitial,oldFinal,oldTrans }} \leftarrow Automaton$ 
\State ${\tuple{ \_,\_,p }} \leftarrow oldInitial$
\State $InitialState \leftarrow {\tuple{ 0,\emptyset,0:p }}$
\Comment {`:' is a list concatenation operator}
\State $FinalStates   \leftarrow \emptyset$
\ForAll {$oldst \in oldFinal$}
  \State $\tuple{ n,v,p } \leftarrow oldst$
  \State $FinalStates \leftarrow FinalStates \cup \set{\tuple{
  n,v,0:p }}$
\EndFor
\State $cross \leftarrow \set{\tuple{ InitialState,oldInitial,0 }}$
\While {$cross \neq \emptyset$}
  \State $newCross \leftarrow \emptyset$
  \Comment{recently created items}
  \ForAll {$\tuple{ ost_1,nst_1,np_1 } \in cross$}
    \ForAll {$\tuple{ ost_1,lex,ost_2 } \in oldTrans$}
      \State ${\tuple{ i,v,op_2 }} \leftarrow ost_2$
      \State $np_2 \leftarrow np_1 + Pol(lex,Key)$
      \Comment {calculate the polarity of the next state}
      \State $nst_2 \leftarrow {\tuple{ i,v,np_2:op_2 }}$
      \State $newCross \leftarrow newCross \cup 
             \set{ \tuple{ ost_2,nst_2,np_2 } }$
      \State $States \leftarrow States \cup \set{nst_2}$
      \State $Transitions \leftarrow Transitions \cup \set{\tuple{
      nst_1,lex,nst_2 }}$
    \EndFor \Comment{$oldTrans$ loop}
  \EndFor \Comment{$cross$ loop}
  \State $cross \leftarrow newCross$
\EndWhile
\State \Return $\PolAutomaton$
\EndFunction
\end{algorithmic}
\end{algorithm}

\begin{algorithm}
\caption{Seed automaton construction algorithm}
\label{alg:seedaut}
\begin{algorithmic}[1]
\Function{SeedAut}{$Lex$,$N$,$LE$,$LSem$}
\State $InitialState \leftarrow {\tuple{ 0,\emptyset,[] }}$
\Comment {`$[]$' indicates the empty list}
\State $FinalStates   \leftarrow \set{ {\tuple{ N,\emptyset,[] }} }$
\State $States \leftarrow \set{InitialState}$
\State $Transitions \leftarrow \emptyset$
\State $created \leftarrow \set{InitialState}$
\ForAll {$lit, 1 < lit < N$}
  \State $newCreated \leftarrow \emptyset$
  \ForAll {$current \in created$}
     \State ${\tuple{ newCreated,newTrans }} \leftarrow$ 
            \CallFunction{SeedAutHelper}{$current$,$Lex$,$LE$,$LSem$}
     \State $States \leftarrow States \cup newCreated$
     \State $Transitions \leftarrow Transitions \cup newTrans$
  \EndFor \Comment{$current$ loop}
  \State $created \leftarrow newCreated$
\EndFor \Comment{$lit$ loop}
\State \Return $\PolAutomaton$
\EndFunction
\Statex
\Function{SeedAutHelper}{$current$,$Lex$,$LE$,$LSem$} \Comment{Simple version}
\State ${\tuple{ lit_{cur},\_,\_ }} \leftarrow current$
\State $lit \leftarrow 1+lit_{cur}$
\State $next \leftarrow {\tuple{ lit,\emptyset,[] }}$
\State $newCreated \leftarrow \set{next}$
\State $newTrans \leftarrow \emptyset$
\ForAll {$lex \in LE(lit)$}
  \State $newTrans \leftarrow newTrans \cup 
           \set{\tuple{ current,lex,next }}$
\EndFor \Comment{$lex$ loop}
\State \Return {$\tuple{ newCreated,newTrans}$}
\EndFunction
\end{algorithmic}
\end{algorithm}

\subsubsection{Multiple literals semantics}
\label{sec:multisem}

One potential complication is that trees in realistic grammars may 
span more than one literal (consider \tautree{is~expensive}, which
covers both \natlang{cost(g,x)} and \natlang{high(x)}).  Since these
trees do not fit neatly into a single column of the automaton
construction table, we repeat them for all the columns of the semantics
they represent.  But this introduces its own share of problems. Once we
use a \jargon{multiliteral tree} for a literal, we must make sure that
we use the same tree for the other literals that it covers, and make
sure that its polarity charge is only taken into account once.

We do this by augmenting the states in the polarity automaton to be a
three-tuple $\tuple{L,V,C}$ where $V$ contains any extra semantics
information which is contributed by a multiliteral tree.  Whenever we
encounter a literal which is in the extra semantics of the state, we can
remove that literal from the extra semantics and make an empty
transition $\epsilon$.  This modification only requires a change to the
construction of the seed automaton (algorithm \ref{alg:seedaut_multi})
and lookup function which assigns no polarity keys to the empty
transition.

\begin{algorithm}
\caption{Modified seed automaton construction algorithm}
\label{alg:seedaut_multi}
\begin{algorithmic}[1]
\Function{SeedAutHelper}{$current$,$Lex$,$LE$,$LSem$} \Comment{Enhanced version}
\State $newCreated \leftarrow \emptyset$
\State $newTrans \leftarrow \emptyset$
\State ${\tuple{ lit_{cur},v_{cur},\_ }} \leftarrow current$ 
\State $lit \leftarrow 1 + lit_{cur}$
\ForAll {$lex \in LE(lit)$}
  \State $v_{lex} \leftarrow LSem(lex)$
  \State $v \leftarrow v_{lex} \cup v_{cur}$
  \If {$lit \in v_{lex}$} \Comment{check extra semantics}
    \State $v \leftarrow v \setminus \set{lit}$
    \State $lex \leftarrow \epsilon$
  \EndIf
  \State $next \leftarrow {\tuple{ lit,v,[] }}$
  \State $newTrans \leftarrow newTrans \cup 
         \set{\tuple{ current,lex,next }}$
  \State $newCreated \leftarrow newCreated \cup \set{next}$
\EndFor 
\State \Return {$\tuple{ newCreated,newTrans}$}
\EndFunction
\end{algorithmic}
\end{algorithm}

\subsection{Chart sharing}
\label{sec:chart_sharing}

The polarity automaton thus yields a compact representation of those
combinations of lexical items that potentially lead to a successful
realisation.  Next it must be used to guide the surface realisation
process. We explored two different ways to do this, which we now
discuss.

The simplest approach is to collect a set of paths by walking the
automaton, and to perform chart generation separately for each path, using
the lexical items along that path as a basis for generation.  The
motivation behind the separation is to prevent mutually incompatible
items from being compared with each other. However, this simple approach
does not account for the fact that different paths may often have
lexical items in common. 

The second, more sophisticated approach consists in factorising these
lexical items that are common to several paths.  To this end, we
annotate each tree with the set of paths on which it appears.  During
generation, we only compare trees if they have some paths in common,
that is if the intersection of their paths is non-empty.  Any resulting
derived tree is annotated with this intersection of paths.  Since the
number of paths is known, we can express these sets as bit vectors,
using the bitwise-and operation to calculate the intersection.  We can
further improve on this concept by introducing the bit vector as a chart
indexing scheme, where the chart is an associative array mapping a bit
vector to a set of trees (algorithm \ref{alg:polarity_chart}).
Retrieving trees from the chart consists of selecting the subset of keys
which intersect a given tree's path set, and taking the union over the
resulting sets of trees.

\begin{algorithm}
\caption{Chart operations with automaton path indexing}
\label{alg:polarity_chart}
\begin{algorithmic}[1]
\Function{ChartRetrieval}{$Chart$,$tree$}\Comment{returns a set of trees}
\State $selection \leftarrow \emptyset$
\ForAll{$k \in Chart.keys$} \Comment{$Chart$ is an associative array}
  \If {$k \cap tree.paths \neq \emptyset$}
     \State $selection \leftarrow selection \cup Chart[k]$
  \EndIf
\EndFor
\State \Return $selection$
\EndFunction
\Statex
\Function{ChartInsertion}{$Chart$,$tree$}\Comment{returns a chart}
\State $samePath \leftarrow Chart[tree.path]$
\If {$samePath = NULL$}
  \State $samePath \leftarrow \emptyset$
\EndIf
\State $Chart[tree.path] \leftarrow samePath \cup \set{tree}$
\State \Return $Chart$
\EndFunction
\end{algorithmic}
\end{algorithm}

\section{Implementation}
\label{sec:implementation}

The optimisations just described were implemented in Haskell.  Since we
did not know the potential effect of optimisations, we had to pay
particular attention to modularity, so that any or all of the
optimisations could be enabled, with special care to account for all
possible combinations of optimisations.  In addition to the
optimisations, we also implemented a few supporting features:

\begin{enumerate}
\item Corrections to the generator as bugs were revealed. This 
      included minor subtleties (havoc caused by neglecting to sort a
      list) and conceptual oversights (for example,
      it was not originally foreseen that auxiliary trees could have
      substitution sites, or that adjunction nodes could be used
      multiple times).
\item A rewritten graphical user interface (figures \ref{fig:geni_gui}
      and \ref{fig:geni_polaut}).  This was mainly motivated by a need
      to test the generator and to visualise the results (automata,
      trees, etc).  The visualisation of trees and automata was made
      possible by the open source Graphviz tool.
\item A graphical debugger (figure \ref{fig:geni_debugger}).  As
      grammars grew and optimisation work continued, it became apparent
      that some means of inspecting the intermediary structures built
      during generation would be necessary.  The graphical debugger
      allows the user to step through the generator, gaining an
      understanding of trees being built and their movement through the
      agenda, chart and results.
\item A batch-processing system.  This made it possible to run a set of
      examples over all possible combinations of optimisations and
      automatically tabulate the results.  
\end{enumerate}

\section{Results}
\label{sec:results}

\subsection{Methodology}

The criteria we used to measure performance were

\begin{description}
\item[agenda size] - the number of items that get added to agenda; note
                     that we do not decrement this counter when a tree
                     is removed from the agenda
\item[chart size]  - the number of items that get added to the chart
\item[comparisons] - the number of times any two trees are
                     compared for substitution or adjunction 
\item[time] - the time needed for generation, excluding the lexical
              selection phase but including the automaton construction
              time.  
\end{description}

The generator was compiled using the Glasgow Haskell Compiler (6.2) with 
-O2 level optimisation.  It was run on a single G4 866 Mhz processor
with Mac OS X as the operating system.  Timing results are averaged over
100 iterations of the generator.

The optimisations evaluated are abbreviated as follows: \jargon{oadj} -
orderded adjunction (section \ref{sec:ordered_adjunction});
\jargon{sfilt} - semantic filtering (section \ref{sec:sfilter});
\jargon{pol} - polarity automata (section \ref{sec:polarity_overview});
\jargon{c-shr} - chart sharing (section \ref{sec:chart_sharing}).  If
the polarity optimisation is used without chart sharing, we sum the
results over each generation subtask.

\subsection{Tests performed}

We evaluated the performance of the generator on two input semantics
with a different grammar for each input.  

\subsubsection{Lexical ambiguity}

The \jargon{promettre} example tests the usefulness of polarity automata
on a grammar with high lexical ambiguity.  The input semantics is
\semexprset{promise(e1, a, b, e2), jean(a), marie(b), convince(e2, a, d,
e3), give(e3, d, c, e), indef(c), book(c), paul(d), claire(e)}, which is
realised in the sentence \natlang{jean promettre a marie de persuader
paul de donner un livre a claire} (note: the generator does not perform
inflection, hence the infinitive \natlang{promettre} instead of
\natlang{promets}).  
%
For this example, the degree of lexical ambiguity in the grammar is as
follows: 14 lexical items for \natlang{promettre}, 6 for
\natlang{persuader}, and 8 for \natlang{donner}.  Thus, the number of 
possible combinations is 672.  With the polarity optimisation, the
number of combinations actually explored by the generator is brought
down to 3.

\begin{center}
\begin{tabular}{|r|r|r|r|r|}
\hline
   optimisations & agenda size & chart size & comparisons & time (ms) \\
\hline
\hline
                 none  &   \colorbox{green}{66} & \colorbox{yellow}{51} &     \colorbox{cyan}{2639} &    20.0\\
\hline
                  pol  &   \colorbox{green}{69} & \colorbox{yellow}{39} &      \colorbox{cyan}{585} &    30.0\\
            pol c-shr  &   \colorbox{green}{37} & \colorbox{yellow}{23} &      \colorbox{cyan}{447} &    30.0\\
\hline
                sfilt  &   53 &      51 &     2569 &    20.0\\
            sfilt pol  &   43 &      39 &      493 &    30.0\\
      sfilt pol c-shr  &   25 &      23 &      385 &    30.0\\
\hline
                 oadj  &  137 &      51 &     2635 &    20.0\\
             pol oadj  &  158 &      39 &      590 &    40.0\\
       pol c-shr oadj  &  100 &      23 &      442 &    30.0\\
\hline
           sfilt oadj  &   72 &      51 &     2570 &    20.0\\
       sfilt pol oadj  &   62 &      39 &      494 &    30.0\\
 sfilt pol c-shr oadj  &   44 &      23 &      386 &    30.0\\
\hline
\end{tabular}\\
\end{center}

Since this example is focused on lexical ambiguity, we only comment here
on the effect of the polarity and chart sharing optimisations.  Using
the polarity optimisation reduces the number of tree comparisons by
three quarters and the chart size by a quarter. Without chart
sharing, however, it increases the agenda size due to the lack of
factorisation.  This is remedied by adding chart sharing, in which case,
the agenda size also decreases by more than a quarter.

\subsubsection{Intersective modifiers}

The \jargon{chatnoir} example examines the generator's performance
against the problem of intersective modifiers.  It has an input
semantics \semexpr{\set{chase(e1,a,b), cat(a), black(a), fierce(a),
def(a), def(b), mouse(b)}} which is realised in the corresponding
grammar by the sentence \natlang{le mechant chat noir chasser le
souris}. 

In the grammar used for this example, the lexical items \natlang{le},
\natlang{noir}, \natlang{mechant} are all auxiliary trees adjoining on
to the trees they modify.   

\begin{center}
\begin{tabular}{|r|r|r|r|r|}
\hline
optimisations & agenda size & chart size & comparisons & time (ms) \\
\hline
\hline
                 none  &         \colorbox{magenta}{121} &      13 & \colorbox{green}{3660} &    50.0\\
\hline
                  pol  &          \colorbox{cyan}{87} &       5 & \colorbox{yellow}{2572} &    40.0\\
            pol c-shr  &          87 &       5 &     2572 &    40.0\\
\hline
                sfilt  &          \colorbox{magenta}{93} &      13 & \colorbox{red}{3076} &    40.0\\
            sfilt pol  &          \colorbox{cyan}{61} &       5 &     2020 &    30.0\\
      sfilt pol c-shr  &          61 &       5 &     2020 &    30.0\\
\hline
                 oadj  &         \colorbox{magenta}{472} &      13 & \colorbox{green}{1988} &    40.0\\
             pol oadj  &         \colorbox{cyan}{326} &       5 & \colorbox{yellow}{1300} &    30.0\\
       pol c-shr oadj  &         326 &       5 &     1300 &    40.0\\
\hline
           sfilt oadj  &         332 &      13 & \colorbox{red}{1428} &    50.0\\
       sfilt pol oadj  &         196 &       5 &      780 &    20.0\\
 sfilt pol c-shr oadj  &         196 &       5 &      780 &    20.0\\
\hline
\end{tabular}\\
\end{center}

This table shows the following results. 

First, we observe that semantic filtering 
has a strong decreasing impact on agenda size, chart size and
comparisons.  Specifically, it reduces agenda size and
chart size by a quarter with respect to the null optimisation baseline.
When combined with the polarity optimisation, it decreases the chart
size by 40\% with respect to the polarity optimisation baseline, and
when combined with polarity and chart sharing, it decreases it by the
same amount.  

Second, note that without ordered adjunction, there are 48 possible
derivations for the result \natlang{le mechant chat noir chasser le
souris}.  Ordered adjunction removes many of these redundant derivations
by bringing this number down to 10  (The 10 that remain represent
different orders for the auxiliary trees \tautree{mechant}, \tautree{le}
and \tautree{noir} to be adjoined at the node \natlang{chat}).

A secondary benefit of this reduction in spurious realisation is an
important decrease in the number of comparisons made.  In general, the
decreases is of approximately 50\% with respect to the corresponding
baseline (54\% with respect to the null optimisation baseline, 50\% with
respect to the polarity baseline and 46\% with respect to the semantic
filtering baseline).  Note further that combining all three
optimisations reduces the number of comparisons by 80\%.

Finally note that the increase in agenda size is misleadingly inflated
because we increment the counter for every adjunction site instead of
every tree. Future work on evaluation will include a better evaluation
criterion which accounts for this factor.

\chapter{Conclusion and future work}
\label{chp:conclusion}

We presented a series of optimisations for a natural language surface
realiser using a Tree Adjoining Grammar and a flat semantics.  These
optimisations dealt with two major sources of complexity for the
generation problem.  With polarity automata and chart sharing, we
provided a means for coping with lexical ambiguity.  With semantic
filtering and ordered adjunction, we reduced the number of intermediary
and final structures produced via adjunction.  A more general judgement
of how these optimisations affect the efficiency of the generator would
require a large-scale evaluation based on the use of a reasonable sized
TAG.

\section{Further optimisation}

\subsection{Polarity signatures}
\label{sec:polarity_signatures}

Constructing and walking a polarity automaton can involve a great deal
of overhead, especially considering that a realistic automaton would be
handling hundreds of trees at any time. One possible solution for
reducing the overhead would be to pre-process the lexically selected
trees, and group together trees with identical semantics and polarity
keys.  The groups are labeled with a tuple of $\tuple{S,K}$ where $S$ is
the tree semantics and $K$ is the set of polarity keys that the tree
holds.  These tuples, which we call \jargon{polarity signatures}, could
then be used instead of trees as the labels of the polarity automaton, the
reasoning being that though there maybe hundreds of trees for a given
lexical item, the number of distinct polarity signatures is likely to be
smaller. 

\subsection{Grouped adjunction}
\label{sec:grouped_adjunction}

Despite delayed adjunction, semantic filtering, and ordered adjunction,
the use of auxiliary trees remains a challenge for generation.  In the
\jargon{chatnoir} test, for example, the sentence \natlang{le mechant
chat noir chasser le souris} was derived in 10 different ways, each of
these representing different orders for the adjunction of auxiliary
trees.  

A possible option to deal directly with the intersective modifiers
problem is to identify auxiliary trees that adjoin to the same site,
then we can group them into sets and instead of performing adjunction
with these trees, we would use a placeholder that represents the entire
group.  When generation is complete, we can post-process the results to
properly insert the modifiers.  Even if the post-processing were to use
a bottom-up algorithm (like the rest of the generation), this
modification would still be useful because it would isolate the
generation of intermediate structures to a single item. In the case of
\natlang{The beautiful heartless woman rejects the dirty smelly man},
this would mean that \natlang{beautiful heartless woman} would be dealt
with separately from \natlang{dirty smelly man}.  Instead of $4! = 24$
intermediary structures, we would only be dealing with $2!  \times 2! =
4$.  It might be possible to use a post-processing strategy that
incorporates word-order constraints, so that \natlang{dirty smelly man}
is generated, but not \natlang{smelly dirty man} which is awkward to
native English speakers.  This would further reduce the number of
intermediate structures, and as an added bonus, produce higher quality
sentences.

\subsection{Distinguished chart indexing}
\label{sec:distinguished_chart}

\cite{kay96chart} proposes a scheme for using semantic indices as chart
indices.  Under this scheme, a \jargon{distinguished index} is
arbitrarily selected for every tree among the semantic indices of that
tree.  The distinguished index is propagated during generation: whenever
a tree is substituted or adjoined into another, the distinguished index
of the derived tree is that of latter tree.  If we consider a tree
\natlang{John} with distinguished index \semexpr{j} for semantics
\semexprset{name(j,john)} and a tree \natlang{smells} with the
distinguished index \semexpr{s} for semantics \semexprset{smells(s,j)}.
Substituting \natlang{John} into \natlang{smells} would result in the
derived tree \natlang{John smells} with the distinguished index
\semexpr{s}.

Kay uses the distinguished index as a chart index.  We had mentioned
in section \ref{sec:tag_semantics} that tree nodes are also associated
variables which are unified against the arguments in the input
semantics.  We call these variables (after unification is done)
\jargon{missing indices}.  Given an agenda tree with a substitution site
to fill, or an adjunction site to test, we only consider for combination
the chart trees whose distinguished index is the same as the missing
index of the agenda tree.  This idea is not yet integrated into the GenI
generator and would definitely be worth exploring.

\section{Large scale testing}
\label{sec:large_testing}

The generator has so far been tested on a small number of linguistic
phenomena with a small grammar.  For linguistic, applicative and
evaluation purposes, it would be necessary to link the generator with a
a full-sized grammar for French. Such a grammar can be produced with
the help of the metagrammar compiler developed by Denys Duchier, Benoit
Crabbe, Joseph Leroux and Yannick Parmentier. This metagrammar
facilitates the semi-automated modular development of grammars.  At
present it is used to develop a core grammar for French and a grammar is
already available which has a linguistic coverage comparable to Anne
Abeille's TAG.

To integrate the GenI algorithm with the metagrammar compiler, we have
developed a basic interface that reads the compiler output, but several
open issues still need to be resolved, so as to fully support generation
from the resulting grammars.  In particular, the unification mechanism
needs to be revised. 

Once this grammar can be used by the GenI generator, full scale
evaluation becomes possible.  This evaluation can be achieved on the
basis of a test suite, which can be written with the program
\textit{Annotator} developed by Marilisa Amoia.  \textit{Annotator}
provides an environment in which semantic expressions may be associated
with a set of natural language expressions.  These can then be saved in
a file, which could eventually be parsed and fed into the batch
processing system for the surface realiser.  This will require some
consideration into the most appropriate means of displaying the
test-suite results.  For example, does a test fail if the generator does
not produce all of the natural language expressions for an input
semantics?  What if it produces too many?  Another issue for
consideration is that the generator should identify the source of test
failures whenever possible, for example, that a certain word is not in
the grammar. Since GenI does not produce inflected forms, it would also
be necessary to either get test-writers in the habit of writing
uninflected expressions (\natlang{Marie detester Jean}), or more
usefully, to pre-process and de-inflect the Annotator files, or better
still for the generator to produce inflected output (\natlang{Marie
deteste Jean}).

\section{Generation system}

Once the generator reaches a satisfactory level of completeness and
efficiency, it would become interesting to integrate it into a wider
generation system. Such a system would perform not only surface
realisation, but a fuller range of generation tasks from content
planning (determining what the surface realiser should say), to applying
the morphological finishing touches.  

This system could itself be used and validated into user applications
such as question-answering systems and interactive dialogue systems, 
the integration of which brings up a host of questions on its own.  One
topic of interest is how a generator should behave when producing a full
text or a dialogue with the user. A sophisticated generator should be
able to produce referring expressions in order to create more economical
and natural sentences.  In other words, it should be able to produce
\natlang{Mary hates John. He smells funny.} instead of \natlang{Mary
hates John. John smells funny.}

In short the tactical generation task that we explored in this thesis 
is only a small part of a vast pool of topics to be explored in natural
language generation.

\Annex{GenI generator screenshots}

\begin{figure}
\begin{center}
\includegraphics[scale=0.50]{images/geni_gui.pdf}
\caption{Graphical interface for the GenI generator}
\label{fig:geni_gui}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[scale=0.50]{images/geni_polaut.pdf}
\caption{Automaton visualisation}
\label{fig:geni_polaut}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[scale=0.50]{images/geni_debugger.pdf}
\caption{Graphical debugger}
\label{fig:geni_debugger}
\end{center}
\end{figure}


\bibliographystyle{alpha}
\bibliography{deakowey}

\begin{ThesisAbstract}
\begin{EnglishAbstract}
The surface realisation task consists in using a grammar to produce the
sentence(s) associated by this grammar with some input meaning
representation.  This task is known to be NP-complete, so that
optimisation is an important aspect of any practical implementation.
Starting from an existing surface realiser, this thesis proposes and
implements four optimisation techniques, evaluates their effects and
identifies directions for further improvement.
\end{EnglishAbstract}
\end{ThesisAbstract}

\end{document}