Feeling BLEU

Posted on 8 July 2013

If you’re working in machine translation or natural language generation, you are likely familiar with a couple of automated metrics for evaluating the quality of their output, notably the BiLingual Evaluation Understudy (BLEU), and its NIST variant. To use these metrics, it can be helpful to learn what’s behind them, both the theory and maths behind them, and also the nitty gritty implementation details around their use.

The former discussed in the papers, and to some extent their Wikipedia articles. Details on the latter, however, seem to be somewhat lacking, so this post is written to fill the gap. We focus on one specific implementation 1 of the script, version 13 of the NIST mt-eval script 2. The information here is mostly gathered from reading the source code (plus a bit of Wikipedia and the papers for the theory), so there may be mistakes and misunderstandings on my part. This post may be less useful if you’re using a different implementation, although hopefully it at least provides a navigation guide to some of the concrete questions you find yourself asking down the line.

Rough idea

BLEU (n-gram precision with a twist)

BLEU is about n-gram precision at its very heart. We want to know the proportion of n-grams in the candidate text also appear in a reference text. There are a couple of twists to account for:

Basically: take n-gram precision, clip it, geometric-average it over different sizes of N-gram, and punish extreme brevity and you’ve got BLEU. The scores go from 0 to 1, as you might expect in a precision/recall type score.


The NIST metric is a variation on BLEU, which (according to the paper) which provides better stability and reliability on its older counterpart. The scores look rather different (no longer from 0 to 1), and they use 5-grams by default (instead of 4), but otherwise keep in the same spirit as BLEU. The main differences in NIST are

If I understand correctly, the core idea of using combined clipped n-gram precisions is the same, the brevity penalty is a bit different, and the change in scale comes from an additional number multiplied in that gives you better scores for more of your matching n-grams being the ones that are rarer wrt their n-1 gram predecessors.


Systems, documents, segments, n-grams

The script is used to evaluate the output generated by a set of systems. Systems can be people or software. Variants of some software (for example different parameterisations) can be treated as separate systems. When running these campaigns, I like to treat the reference texts as yet another system and feed it through the scoring pipeline.

System texts can be broken down into documents, which are further broken down into segments (eg. sentences), which in turn are broken down into n-grams.

Tokenisation and normalisation

Text is assumed to be in Unicode with the UTF-8 encoding.

Tokenisation seems to be done by splitting on whitespace. Prior to tokenisation the following pre-processing is done:

There is also an “international” tokenisation option which is off by default. I haven’t looked into it much, but I can say it uses the perl lc function for lower-casing, whereas the standard just dumbly folds A-Z to a-z.

sub tokenization
	my ($norm_text) = @_;

# language-independent part:
	$norm_text =~ s/<skipped>//g; # strip "skipped" tags
	$norm_text =~ s/-\n//g; # strip end-of-line hyphenation and join lines
	$norm_text =~ s/\n/ /g; # join lines
	$norm_text =~ s/&quot;/"/g;  # convert SGML tag for quote to "
	$norm_text =~ s/&amp;/&/g;   # convert SGML tag for ampersand to &
	$norm_text =~ s/&lt;/</g;    # convert SGML tag for less-than to >
	$norm_text =~ s/&gt;/>/g;    # convert SGML tag for greater-than to <

# language-dependent part (assuming Western languages):
	$norm_text = " $norm_text ";
	$norm_text =~ tr/[A-Z]/[a-z]/ unless $preserve_case;
	$norm_text =~ s/([\{-\~\[-\` -\&\(-\+\:-\@\/])/ $1 /g;   # tokenize punctuation
	$norm_text =~ s/([^0-9])([\.,])/$1 $2 /g; # tokenize period and comma unless preceded by a digit
	$norm_text =~ s/([\.,])([^0-9])/ $1 $2/g; # tokenize period and comma unless followed by a digit
	$norm_text =~ s/([0-9])(-)/$1 $2 /g; # tokenize dash when preceded by a digit
	$norm_text =~ s/\s+/ /g; # one space only between words
	$norm_text =~ s/^\s+//;  # no leading space
	$norm_text =~ s/\s+$//;  # no trailing space

	return $norm_text;


Blocks of text

The mt-eval script computes segment-level, document-level, and system-level scores (reporting system-level scores by default). If I understand correctly, the scores on larger units are not averages of their smaller counterparts, but aggregrates. The core idea here is to compute the counts that go behind the scores separately for each sentence, but then lump them all together for the whole corpus to compute the score. So if you have a set of fractions, rather than taking the average, you compute something more like this:

 n_1 + n_2 + .. n_x
 d_1 + d_2 + .. d_x

Digging a bit into the script implementation, the script runs segment by segment populating a map from N-gram sizes (ie. 1 to 4) to various counts (eg. number of matching n-grams, number of reference n-grams). The scoring algorithm take these maps as inputs, and the difference between the segment-level, document-level and system-level scores is that the counts on the system level is the sum of the counts on the document level, and the counts on document level scores is the sum of the counts on the segment level.

Missing segments

Missing segments will cause version 13 of the mt-eval script to crash (divide by zero). This is probably better than something silently happening behind the scenes that you’d have to dig through some documentation to find out about. There are a couple ways to deal with this, for example, counting it as a zero (which makes me nervous). Our approach has been to score quality separately from coverage and basically omit the missing segments from each side. Note that this means generating a seperate reference text for each system with potentially different segments missing.

Cumulative vs individual

The script distinguishes between a “cumulative” and an “individual” score. As far as I can tell, for an given N-gram length

Whereas the individual 4-gram score would just be based on counting occurrences of 4 token sequences, the cumulative score would also include trigrams, bigrams, and unigrams (it would be the geometric mean of these scores)

  1. As an aside, there are at least two stances you can adopt to the use of evaluation software. First that it is safer to use the NIST script as a sort of conservative default — this way you’re really using the same metric as everybody else — or second that we’re better off seeing different implementations of the same metric in the wild, which may help flush out unspecificed idiosyncracies, bugs in one version, and so forth. Despite finding compelling the argument that we should resist letting a particular implementation become the de-facto definition of a standard, I have chosen the “conservative” default of going with the NIST implementation.

  2. Mt-eval does not seem to have a homepage of its own, but seems to be inherited by evaluation campaigns from one year to the next, for example OpenMT 2009.

  3. If you have more than one reference text, we take the max count for each text. In other words, the clipping becomes a bit more forgiving, or a little less pronounced, although presumably not by a whole lot… This seems to apply to other aspects of BLEU as a rule of thumb; when there are multiple reference texts, choose the text which gives the best score for this particular count.