If you’re working in machine translation or natural language generation, you are likely familiar with a couple of automated metrics for evaluating the quality of their output, notably the BiLingual Evaluation Understudy (BLEU), and its NIST variant. To use these metrics, it can be helpful to learn what’s behind them, both the theory and maths behind them, and also the nitty gritty implementation details around their use.
The former discussed in the papers, and to some extent their Wikipedia articles. Details on the latter, however, seem to be somewhat lacking, so this post is written to fill the gap. We focus on one specific implementation 1 of the script, version 13 of the NIST mt-eval script 2. The information here is mostly gathered from reading the source code (plus a bit of Wikipedia and the papers for the theory), so there may be mistakes and misunderstandings on my part. This post may be less useful if you’re using a different implementation, although hopefully it at least provides a navigation guide to some of the concrete questions you find yourself asking down the line.
BLEU is about n-gram precision at its very heart. We want to know the proportion of n-grams in the candidate text also appear in a reference text. There are a couple of twists to account for:
clipped precision - to deal overgeneration of common words, BLEU uses a “clipped” notion of precision that only accepts as many instances of a word as actually appear in some reference text. 3 If your text says “frog” 7 times, but that word only appears twice in the reference text, the clipped precision would be 2/7 instead of 7/7.
combined n-gram sizes - BLEU is about N-gram precision, but for what N? BLEU computes individual scores for a range of 1 to a maximum (by default 4), and computes the geometric mean of these via some logarithm magic (exp $ sum [ w * log s | s <- scores ]
)
brevity penalty - to game a precision metric, you might get away with just saying as little as possible so the little you say is right. For example, you might have a text that consists entirely of the word “the” and its precision would be 100%. Recall would be a way to capture these sorts of issues, but according to the paper is problematic when you have multiple reference texts (recall-everything-soup). Instead, BLEU uses a brevity penalty, which punishes you for having sentences that are shorter than the reference. The penalty allows for wiggle room, first if you have multiple reference texts by choosing the best matching text for each sentence; and second, by working over the corpus as a whole, rather than averaging over individual sentences. This way, your shorter sentences can effectively borrow a little length from their longer counterparts.
Basically: take n-gram precision, clip it, geometric-average it over different sizes of N-gram, and punish extreme brevity and you’ve got BLEU. The scores go from 0 to 1, as you might expect in a precision/recall type score.
The NIST metric is a variation on BLEU, which (according to the paper) which provides better stability and reliability on its older counterpart. The scores look rather different (no longer from 0 to 1), and they use 5-grams by default (instead of 4), but otherwise keep in the same spirit as BLEU. The main differences in NIST are
n-gram information value - the main addition in the NIST metric is to give more weight to more informative (read infrequently occuring) n-grams, which apparently helps the metric to resist gaming. For a given N-gram the informativeness is based on the proportion of occurences of the N-1 gram leading up to it and occurences of the N-gram itself (so to compute the informativeness of “my yellow pet duck”, we count the number times of “my yellow pet” occurs, divide by the number of times the full n-gram occurs, take the log)
changed brevity penalty - the penalty was tuned to “minimize the impact on the score of small variantios in the length of the translations”, basically keeping the spirit of protecting the metric against gaming, but with less wobble.
If I understand correctly, the core idea of using combined clipped n-gram precisions is the same, the brevity penalty is a bit different, and the change in scale comes from an additional number multiplied in that gives you better scores for more of your matching n-grams being the ones that are rarer wrt their n-1 gram predecessors.
The script is used to evaluate the output generated by a set of systems. Systems can be people or software. Variants of some software (for example different parameterisations) can be treated as separate systems. When running these campaigns, I like to treat the reference texts as yet another system and feed it through the scoring pipeline.
System texts can be broken down into documents, which are further broken down into segments (eg. sentences), which in turn are broken down into n-grams.
Text is assumed to be in Unicode with the UTF-8 encoding.
Tokenisation seems to be done by splitting on whitespace. Prior to tokenisation the following pre-processing is done:
end-of-line hyphens are suppressed (line 2), eg in the word “representatives” below
We, therefore, the represen-
tatives of the United States
of America.
<skipped>
ASCII punctuation characters except for ,-.
(apostrophe, comma, hypen, and full stop) are tokenised. This is expressed as a series of ranges (line 17), which I’ve expanded by looking at the Wikipedia page on ASCII:
{|}~
(curly braces, pipe)[]^_`
(square brackets, backslash, caret, underscore, backtick)␠!“#$%&
(space, bang, hash, dollar, percent, ampersand)()*+
(brackets, star, plus):;<=>?@
(colon, semicolon, angle brackets, equals, question mark, at)/
(slash)42-
becomes 42
and -
) (line 21)whitespace compresssion (no leading/trailing space, at most one space between tokens, lines 21 to 23)
There is also an “international” tokenisation option which is off by default. I haven’t looked into it much, but I can say it uses the perl lc
function for lower-casing, whereas the standard just dumbly folds A-Z to a-z.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
The mt-eval script computes segment-level, document-level, and system-level scores (reporting system-level scores by default). If I understand correctly, the scores on larger units are not averages of their smaller counterparts, but aggregrates. The core idea here is to compute the counts that go behind the scores separately for each sentence, but then lump them all together for the whole corpus to compute the score. So if you have a set of fractions, rather than taking the average, you compute something more like this:
n_1 + n_2 + .. n_x
------------------
d_1 + d_2 + .. d_x
Digging a bit into the script implementation, the script runs segment by segment populating a map from N-gram sizes (ie. 1 to 4) to various counts (eg. number of matching n-grams, number of reference n-grams). The scoring algorithm take these maps as inputs, and the difference between the segment-level, document-level and system-level scores is that the counts on the system level is the sum of the counts on the document level, and the counts on document level scores is the sum of the counts on the segment level.
Missing segments will cause version 13 of the mt-eval script to crash (divide by zero). This is probably better than something silently happening behind the scenes that you’d have to dig through some documentation to find out about. There are a couple ways to deal with this, for example, counting it as a zero (which makes me nervous). Our approach has been to score quality separately from coverage and basically omit the missing segments from each side. Note that this means generating a seperate reference text for each system with potentially different segments missing.
The script distinguishes between a “cumulative” and an “individual” score. As far as I can tell, for an given N-gram length
Whereas the individual 4-gram score would just be based on counting occurrences of 4 token sequences, the cumulative score would also include trigrams, bigrams, and unigrams (it would be the geometric mean of these scores)
As an aside, there are at least two stances you can adopt to the use of evaluation software. First that it is safer to use the NIST script as a sort of conservative default — this way you’re really using the same metric as everybody else — or second that we’re better off seeing different implementations of the same metric in the wild, which may help flush out unspecificed idiosyncracies, bugs in one version, and so forth. Despite finding compelling the argument that we should resist letting a particular implementation become the de-facto definition of a standard, I have chosen the “conservative” default of going with the NIST implementation.↩
Mt-eval does not seem to have a homepage of its own, but seems to be inherited by evaluation campaigns from one year to the next, for example OpenMT 2009.↩
If you have more than one reference text, we take the max count for each text. In other words, the clipping becomes a bit more forgiving, or a little less pronounced, although presumably not by a whole lot… This seems to apply to other aspects of BLEU as a rule of thumb; when there are multiple reference texts, choose the text which gives the best score for this particular count.↩