* This document is a memo researched and formatted with the help of ChatGPT/Gemini in order to organize my own understanding.

A survey of surface matching metrics in natural language processing

1. Overview

1.1 Scope of evaluation: text-generation evaluation based on surface matching

When evaluating text-generation tasks in natural language processing, it is common to measure performance by comparing the hypothesis sentence output by the system with the reference sentence that serves as the ground truth. The evaluation metrics discussed in this article—BLEU, chrF, chrF++, ROUGE, and METEOR—all evaluate text generation using string-level agreement with the reference as a clue.

"Surface matching" here does not refer to the approach, mainstream in recent years, of embedding meaning representations into a vector space with a trained model and comparing them. It refers to methods that compute a score from the overlap of words (tokens), characters, and subsequences, or from the agreement of their correspondences (alignments).

1.2 Classification and qualitative comparison of representative metrics

The representative surface-matching metrics widely used in text-generation evaluation each adopt a different basic unit.

BLEU: A classic benchmark that uses word N-grams as its basic unit and is used mainly for machine translation evaluation.
chrF / chrF++: Uses character N-grams as its basic unit (chrF++ additionally adds shorter word N-grams). It is robust to inflection and orthographic variation and has low dependence on tokenization differences.
ROUGE: Not a single metric but a family of metrics. It evaluates sequence structure using not only word N-grams but also the longest common subsequence (LCS) and skip-bigrams. It is used mainly for summarization tasks.
METEOR: Strictly speaking, it goes beyond pure surface agreement, offering extensibility to use lexical resources such as stemming, synonyms, and paraphrases. It is therefore positioned as a metric capable of a "lexically softer" evaluation one step beyond BLEU and ROUGE. Performing explicit word alignment is another major characteristic.

1.3 Differences in design philosophy: precision vs. recall

One of the most essential differences among these metrics is where the denominator of the computation is placed—that is, whether more weight is placed on "precision" or on "recall."

Precision-oriented: BLEU is a precision-type metric that places the count on the output candidate side in the denominator. It is designed to measure "how reference-like the system output is (whether it contains spurious generation)."
Recall-oriented: ROUGE is designed as a recall-type metric that places the reference side in the denominator. It emphasizes "how much of the information contained in the reference the system output can cover (recover)," reflecting the characteristics of summarization tasks, which dislike information loss.

In addition, chrF and METEOR adopt an F-score that combines precision and recall, but each metric incorporates its own design philosophy: METEOR computes an \(F_{mean}\) that weights recall heavily, and chrF can adjust the recall side via a parameter (\(\beta\)).

2. Common definitions and assumptions

2.1 Data structures and notation

In text-generation evaluation, the "hypothesis sentence" output by the system is compared with the "reference sentence" that serves as the ground truth. This article takes sentence-level evaluation as the basis and uses the following notation in the formulas.

System output / hypothesis: \(h\)
Reference: \(r\)
Multiple references: When multiple ground-truth answers are prepared for a single hypothesis, we define the set as \(\mathcal{R} = \{r^{(1)}, \dots, r^{(K)}\}\).

When the evaluation target is an entire corpus (a dataset of multiple sentences), we express the set of sentence pairs as follows.

\[ \{(h_s, \mathcal{R}_s)\}_{s=1}^S \]

2.2 N-gram multisets and clipping

Many surface-matching metrics compute using the occurrence counts of N-grams (sequences of \(n\) consecutive elements) extracted from the text as a clue.

Let the multiset of N-grams extracted from some sequence \(x\) be \(G_n(x)\), and write the occurrence count of a specific N-gram \(g\) as follows.

\[ \operatorname{Count}_x(g) \]

When we write \(\sum_g\) in a formula, it means the sum over all distinct N-gram types that can occur at the relevant order (in practice, it suffices to scan only the N-grams that exist).

The number of N-gram overlaps (matches) between two multisets is computed by taking the minimum of their respective occurrence counts.

\[ \operatorname{Match}(g; x, y) = \min\bigl(\operatorname{Count}_x(g), \operatorname{Count}_y(g)\bigr) \]

Furthermore, an important concept used in metrics such as BLEU is "clipping." This is a process to prevent the system from unfairly boosting its score by over-generating the same word (e.g., an output like the the the...). When multiple references exist, for each N-gram we find the maximum occurrence count across the reference set.

\[ \operatorname{Count}^{\max}_{\mathcal{R}}(g) = \max_{k=1}^{K} \operatorname{Count}_{r^{(k)}}(g) \]

Then, we adopt as the final match count the occurrence count in the hypothesis capped (clipped) by this maximum reference count.

\[ \operatorname{ClipMatch}(g; h, \mathcal{R}) = \min\bigl(\operatorname{Count}_h(g), \operatorname{Count}^{\max}_{\mathcal{R}}(g)\bigr) \]

2.3 Dependence on preprocessing (normalization and tokenization)

Even for a metric rigorously defined by a formula, the final computed score varies greatly depending on differences in preprocessing of the input text. To ensure the reproducibility of the evaluation and a fair comparison with other papers and systems, it is essential to fix at least the following conditions during implementation and to state them clearly in the report.

Case sensitivity: whether or not case is normalized and treated as identical
Treatment of punctuation: whether it is split as an independent token or removed
Choice of tokenizer: which algorithm or library is used
Granularity of evaluation: whether the string is treated as a sequence of words (tokens) or as a sequence of characters
Treatment of whitespace: whether whitespace is included in the count as a single character

The degree of dependence on this preprocessing differs by the nature of the metric. In particular, the word-based BLEU is extremely strongly affected by the tokenization method. On the other hand, chrF, which is designed to be character-based, has the advantage of relatively small dependence on tokenization; but if one adopts a metric that incorporates word N-grams into the evaluation, like chrF++, then the design of word segmentation once again intervenes in the score.

3. BLEU: corpus-level evaluation based on N-gram precision

BLEU is a representative metric used mainly for machine translation evaluation and the like. It has a precision-oriented tendency to ask how much of the word \(n\)-grams contained in the hypothesis sentence (system output) are contained in the reference sentence. Its characteristics are that it takes the geometric mean of the precisions across multiple orders, not just a single order, and further combines a penalty for outputs that are too short (brevity penalty).

3.1 Formulation of modified N-gram precision

The foundation of BLEU is the modified precision, which counts whether the \(n\)-grams of the hypothesis sentence exist in the reference sentence. Simply counting the number of matches has the problem that the score becomes unfairly high when the system over-generates the same word (e.g., a case where the hypothesis is the the the the ...). To prevent this, the count on the hypothesis side is clipped (capped) by the maximum occurrence count on the reference side.

The modified precision \(p_n(h, \mathcal{R})\) of order \(n\) for a sentence \(h\) and multiple references \(\mathcal{R}\) is defined as follows.

\[ p_n(h, \mathcal{R}) = \frac{\sum_g \operatorname{ClipMatch}(g; h, \mathcal{R})}{\sum_g \operatorname{Count}_h(g)} \]

Here, \(\operatorname{ClipMatch}\) is a function that adopts the smaller of the \(n\)-gram occurrence count on the hypothesis side and the maximum occurrence count on the reference side.

3.2 Corpus-level aggregation and the final BLEU score

The standard computation of BLEU does not average precision per sentence; instead, it sums the match counts and total counts over the entire corpus and then computes. The corpus-level \(p_n\) for a set of sentence pairs \(\{(h_s, \mathcal{R}_s)\}_{s=1}^S\) is obtained by the following formula.

\[ p_n = \frac{\sum_{s=1}^{S} \sum_g \operatorname{ClipMatch}(g; h_s, \mathcal{R}_s)}{\sum_{s=1}^{S} \sum_g \operatorname{Count}_{h_s}(g)} \]

The final BLEU score is computed by multiplying the geometric mean of \(p_n\) up to the maximum order \(N\) by the brevity penalty (\(BP\)) described later.

\[ \operatorname{BLEU}_N = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) \]

The weights \(w_1, \dots, w_N\) are usually uniform, and a typical setting uses \(N = 4, w_n = 1/4\).

3.3 Length correction via the brevity penalty

Because BLEU is a metric biased toward precision, it tends to earn a high score when the system outputs only the words it is confident about, keeping the output short. To address this problem, a brevity penalty (\(BP\)) is introduced that attenuates the score when the length of the hypothesis sentence is too short relative to the length of the reference sentence.

Define the total length of the hypothesis corpus as \(c = \sum_{s=1}^{S} |h_s|\). For each sentence \(s\), let \(r_s^*\) be the length of the reference sentence whose length is closest to the hypothesis length \(|h_s|\) among the multiple references, and obtain the effective reference length \(r = \sum_{s=1}^{S} |r_s^*|\). In practice, when there are multiple reference lengths with an equal length difference, it is common to choose the shorter length.

\[ BP = \begin{cases} 1 & (c > r) \\ \exp(1-r/c) & (c \le r) \end{cases} \]

The mechanism is such that when the hypothesis corpus length exceeds the effective reference length no penalty is applied (\(BP=1\)), and only when it is shorter does the score decrease exponentially.

3.4 Sentence-level BLEU and smoothing methods

BLEU is originally a metric intended for use at the corpus level. When this is applied to a single sentence, the moment the match count becomes 0 at some order, \(\log p_n = -\infty\), and the entire sentence-level BLEU score tends to collapse to 0.

For this reason, it is standard to introduce smoothing in sentence-level evaluation. One frequently used method is the NIST-derived exponential smoothing compared by Chen & Cherry (2014). In this method, from the stage where a match count of 0 first occurs at a higher order, a small positive value is substituted according to the following formula.

\[ p_n \leftarrow \frac{1}{2^k \cdot \text{total}_n} \]

Here, \(k\) is a coefficient that increases each time a zero match occurs. This process prevents extreme score loss due to partial mismatches and enables stable evaluation even at the sentence level.

4. chrF / chrF++: extension to character-level N-grams and morphological robustness

Word-based metrics like BLEU have the problem of being fragile to inflection, morphological change, and differences in compound-word segmentation. chrF and its extended version chrF++ are metrics that alleviate this problem and respond more flexibly to morphological changes.

4.1 chrF: F-score computation of character N-grams

chrF is an F-score based on the agreement of character-level \(n\)-grams. Because it does not easily depend on word boundaries, it has the characteristic of being robust to highly inflected languages and minor orthographic variation.

For a hypothesis sentence \(h\) and a reference sentence \(r\), define the occurrence counts of character \(n\)-grams as \(\operatorname{Count}^{(c,n)}_h(g)\) and \(\operatorname{Count}^{(c,n)}_r(g)\), respectively. The precision and recall at order \(n\) are formulated as follows.

\[ P_n^{(c)} = \frac{\sum_g \min\bigl(\operatorname{Count}^{(c,n)}_h(g), \operatorname{Count}^{(c,n)}_r(g)\bigr)}{\sum_g \operatorname{Count}^{(c,n)}_h(g)} \]

\[ R_n^{(c)} = \frac{\sum_g \min\bigl(\operatorname{Count}^{(c,n)}_h(g), \operatorname{Count}^{(c,n)}_r(g)\bigr)}{\sum_g \operatorname{Count}^{(c,n)}_r(g)} \]

These are arithmetically averaged over character orders \(1\) to \(N_c\) (in a typical setting, \(N_c = 6\)) to give \(\operatorname{chrP}\) and \(\operatorname{chrR}\), respectively. The final chrF score is computed by the following formula.

\[ \operatorname{chrF}_{\beta} = \frac{(1+\beta^2)\,\operatorname{chrP}\,\operatorname{chrR}}{\beta^2\operatorname{chrP}+\operatorname{chrR}} \]

The parameter \(\beta\) allows the balance between precision and recall to be adjusted, and setting \(\beta > 1\) places weight on recall. Empirically, chrF2 with \(\beta=2\) often shows good correlation and is widely used. Note that, as a design choice, whitespace is generally excluded from the computation and not counted as a character.

4.2 chrF++: introducing word-order constraints by integrating word N-grams

While chrF is strong against morphological change, it has the drawback of being weakly sensitive to disruptions in word order. To compensate for this weakness, chrF++ is a metric that maintains the flexibility of the character level while adding short word \(n\)-grams to incorporate the evaluation of local word order and lexical exactness.

In addition to the character orders \(N_c\) (usually \(N_c=6\)), it computes the precision \(P_m^{(w)}\) and recall \(R_m^{(w)}\) at word orders \(m\) (usually up to \(N_w=2\)), and averages the character and word orders together.

\[ P = \frac{\sum_{n=1}^{N_c} P_n^{(c)} + \sum_{m=1}^{N_w} P_m^{(w)}}{N_c + N_w} \]

\[ R = \frac{\sum_{n=1}^{N_c} R_n^{(c)} + \sum_{m=1}^{N_w} R_m^{(w)}}{N_c + N_w} \]

The final \(\operatorname{chrF++}_{\beta}\) is computed using the \(P\) and \(R\) obtained above.

\[ \operatorname{chrF++}_{\beta} = \frac{(1+\beta^2)PR}{\beta^2P + R} \]

The character-based chrF has very low dependence on tokenization, but note that chrF++ introduces word \(n\)-grams, so it comes to depend on the design specifications of word segmentation, such as whitespace splitting, separation of punctuation, and case normalization.

4.3 Aggregation rules for multiple references (paper-style vs. sacreBLEU-style)

When aggregating the score over an entire corpus, subtle differences arise depending on the implementation library in the handling of multiple references and the smoothing of empty orders. To ensure the reproducibility of the evaluation, it is desirable to state clearly which of the following aggregation styles is used.

Paper-style: The form of the original paper, which computes the precision and recall of each order against a single reference and averages them. When multiple references exist, a score is computed for each reference, and the maximum or the mean is taken.
sacreBLEU-style: For each sentence, chrF is computed against each reference, the "best reference" showing the highest score is selected, and its sufficient statistics are summed over the entire corpus. The best reference index \(k_s^*\) is determined as follows.

\[ k_s^* = \arg\max_k \operatorname{chrF}_{\beta}(h_s, r_s^{(k)}) \]

When comparing scores across different tools, one must take into account the effects of these implementation differences.

5. ROUGE: a family of metrics based on recall and the longest common subsequence

5.1 ROUGE as a metric family and its evaluation orientation (coverage in summarization tasks)

ROUGE is not the name of a single metric but a family of metrics encompassing multiple evaluation methods. In the original paper, variations such as ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S are mainly defined.

Compared to BLEU, which places emphasis on precision with the system output in the denominator, the greatest characteristic of ROUGE is that it is designed to emphasize recall. Therefore, it is standardly adopted in the evaluation of text summarization tasks, where the question is how well the system output covers the information elements contained in the reference data.

5.2 ROUGE-N: word N-gram recall

ROUGE-N is an evaluation metric based on the overlap of word \(n\)-grams. In contrast to the precision-type BLEU, ROUGE-N takes a recall-type computation that places the reference side in the denominator.

The word \(n\)-gram recall for a single reference \(r\) and a hypothesis (system output) \(c\) is defined as follows.

\[ \operatorname{ROUGE\text{-}N}(r,c) = \frac{\sum_g \min\bigl(\operatorname{Count}_r(g), \operatorname{Count}_c(g)\bigr)}{\sum_g \operatorname{Count}_r(g)} \]

Handling of multiple references and implementation notes

When multiple references exist, the original ROUGE package describes an implementation that computes the ROUGE score pairwise against each reference and takes the maximum, or a procedure that also uses jackknifing. However, since some recent evaluation libraries also have implementations that take the mean, when evaluating with multiple references it is important for reproducibility to always state clearly which aggregation rule was adopted.

5.3 ROUGE-L / W / S: evaluation of sequence structure (LCS and skip-bigram)

To evaluate the structure as a sequence—such as sentence order and phrase groupings in the summary text—the ROUGE family also introduces methods other than simple \(n\)-grams.

ROUGE-L (longest common subsequence)

A metric that uses the longest common subsequence (LCS). The elements need not be consecutive, but it evaluates whether the order of word occurrence is preserved. Let the reference sequence be \(X=(x_1,\dots,x_m)\) and the candidate sequence be \(Y=(y_1,\dots,y_n)\), and let the LCS length be \(L = \operatorname{LCS}(X,Y)\); then the sentence-level ROUGE-L is computed as follows.

\[ R_{lcs} = \frac{L}{m}, \qquad P_{lcs} = \frac{L}{n} \]

\[ F_{lcs} = \frac{(1+\beta^2)R_{lcs}P_{lcs}}{R_{lcs}+\beta^2P_{lcs}} \]

For the evaluation of an entire summary (corpus level), the original paper uses a computational definition called the summary-level union-LCS.

ROUGE-W (weighted LCS)

While the LCS used in ROUGE-L evaluates the order relationship, it does not distinguish between the lengths of consecutive matches and discrete matches. ROUGE-W is a Weighted LCS that improves on this, designed to give a larger reward to subsequences that match consecutively over a long span. Specifically, it uses a weight function satisfying a condition such as \(f(k) = k^2\) and is computed by dynamic programming (DP).

ROUGE-S (skip-bigram)

A metric that evaluates matching using ordered pairs of two words (skip-bigrams) as the unit. Any word intervening between the two words is allowed. Letting the number of matches be \(SKIP2(X,Y)\), the recall and precision are obtained as follows.

\[ R_{skip2} = \frac{SKIP2(X,Y)}{\binom{m}{2}}, \qquad P_{skip2} = \frac{SKIP2(X,Y)}{\binom{n}{2}} \]

Also, to prevent unlimited skips, there is a variation that imposes a maximum constraint (\(d_{skip}\)) on the distance between words. In that case, the denominator is also replaced with the total number of skip-bigrams that can be generated within that distance constraint.

6. METEOR: a generalization using explicit alignment and lexical resources

METEOR is an evaluation metric designed out of reflection on BLEU's "bias toward precision." Unlike N-gram-based metrics such as BLEU, it computes an explicit word alignment and computes the score by applying a fragmentation penalty to an F-score that emphasizes recall. Furthermore, its greatest characteristic is that it has the extensibility to use external lexical resources such as stemming, synonyms, and paraphrase expressions, not staying at strict surface agreement.

6.1 1-to-1 alignment and hierarchical matching modules

The foundation of METEOR's computation is the construction of a one-to-one alignment between the hypothesis sentence and the reference sentence. For the hypothesis-side word sequence \(t_1,\dots,t_m\) and the reference-side word sequence \(r_1,\dots,r_n\), we define an alignment set \(A \subseteq \{1,\dots,m\} \times \{1,\dots,n\}\) in which each word corresponds at most once.

In cases where multiple alignment candidates exist, METEOR first prioritizes the alignment that "maximizes the number of corresponding words." If candidates still cannot be narrowed down, it selects the alignment with "the fewest crossings (crosses)." Note that the condition for two correspondences \((t_i,r_j)\) and \((t_k,r_l)\) to cross is formulated using their positional relationship as \((i-k)(j-l) < 0\).

The basic matching process in the original paper is applied in the following hierarchical module order.

Exact match
Stem match
WordNet synonym match (synonym matching using WordNet)

Because each later module is applied only to words not yet matched by the earlier processing, comprehensive matching combining multiple modules is achieved.

6.2 Introduction of the chunk penalty

Letting the number of matched words by the alignment be \(M = |A|\), the precision \(P\) and recall \(R\) are obtained as \(P = M/|h|\) and \(R = M/|r|\), respectively. In the basic design of the 2005 version, recall is valued far more highly than precision, and a variant of the harmonic mean, the F-score (\(F_{mean}\)), is computed as follows.

\[ F_{mean} = \frac{10PR}{R+9P} \]

In addition, METEOR incorporates the disruption of word order and the decline in fluency into the evaluation as a "chunk penalty." On both the hypothesis and reference sides, we define a chunk as the smallest grouping of matched words arranged "consecutively and in the same order." Letting this number of chunks be \(ch\), the penalty is computed by the following formula.

\[ Penalty = 0.5\left(\frac{ch}{M}\right)^3 \]

If the word arrangement matches completely, the number of chunks is 1 and the penalty is minimal; the more the matched words become fragmented and dispersed, the more the number of chunks increases and the larger the penalty becomes.

The final METEOR score is defined by subtracting the penalty from the computed F-score, as follows.

\[ METEOR = F_{mean}(1-Penalty) \]

Note that in tasks where multiple references exist, a score is computed individually against each reference, and the maximum is adopted as the evaluation value for that sentence.

6.3 Generalization of parameters by METEOR Universal

In subsequent research such as "METEOR Universal," considering application to diverse languages and tasks, parameters are generalized, such as the weighting of each matching module and the distinction between content words and function words.

Letting the weight of module \(i\) be \(w_i\) and the evaluation weight of content words be \(\delta\), and classifying and computing the words of the hypothesis and reference respectively, the computation of precision and recall is extended as follows.

\[ P = \frac{\sum_i w_i\bigl(\delta m_i(h_c) + (1-\delta)m_i(h_f)\bigr)}{\delta |h_c| + (1-\delta)|h_f|} \]

\[ R = \frac{\sum_i w_i\bigl(\delta m_i(r_c) + (1-\delta)m_i(r_f)\bigr)}{\delta |r_c| + (1-\delta)|r_f|} \]

Here, \(h_c, h_f\) denote the numbers of content words and function words in the hypothesis sentence, \(r_c, r_f\) denote the numbers of content words and function words in the reference sentence, and \(m_i(\cdot)\) denotes the number of words matched by module \(i\).

Furthermore, the computation of \(F_{mean}\) and the penalty is also generalized by introducing hyperparameters \(\alpha, \beta, \gamma\) as follows.

\[ F_{mean} = \frac{PR}{\alpha P + (1-\alpha)R} \]

\[ Pen = \gamma \left(\frac{ch}{m}\right)^{\beta} \]

\[ Score = (1-Pen)F_{mean} \]

Here, \(m\) means the average of the numbers of covered words on the hypothesis and reference sides.

The initial 2005-version METEOR can be interpreted as a special case of this general form. Specifically, using only a single unigram-based match and setting the parameters to \(\alpha = 0.9\), \(\gamma = 0.5\), \(\beta = 3\) (under which \(m=M\) is assumed) reproduces the behavior of the initial formula. Thus, because METEOR has the flexibility of alignment structure and parameter tuning, it enables precise evaluation analysis beyond simple surface matching.

7. Conclusion

This article organized the mathematical definitions and design philosophies of representative surface-matching evaluation metrics for text-generation tasks. To summarize the characteristics of each metric:

BLEU: A classic, highly reproducible metric for machine translation that multiplies the geometric mean of word N-gram precision by a brevity penalty (a penalty for outputs that are too short).
chrF / chrF++: chrF is computed based on the F-score of character N-grams and is highly robust to morphological change and differences in tokenization method. chrF++ is an extended version that adds short word N-grams to this, compensating for sensitivity to word order.
ROUGE: A family of metrics designed to emphasize recall; in summarization tasks, ROUGE-N and ROUGE-L are mainly used as standards.
METEOR: A metric with a more expressive design than BLEU, combining explicit word alignment, weighting toward recall, a chunk penalty that penalizes word-order disruption, and external lexical resources.

These surface-matching metrics cannot completely evaluate the "semantic equivalence itself" between the system output and the reference sentence. Nevertheless, because they are computationally light and easy to implement, allow easy quantitative comparison on a reference basis, and have high compatibility with past benchmarks, they still play an extremely important role in the experimental frameworks of model evaluation.

As a best practice for system evaluation in practice, it is recommended not to rely on a single metric but to report multiple metrics together and evaluate from multiple angles. For example, by combining metrics such as "BLEU + chrF" for machine translation tasks and "ROUGE-N + ROUGE-L" for summarization tasks, one can alleviate the biases of specific preprocessing or metrics and enable safer, more reliable performance measurement.

8. References

The formula definitions and algorithm explanations in this article are based on the following original papers and implementation standards.

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation". ACL 2002.
Chen, B., & Cherry, C. (2014). "A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU". WMT 2014.
Post, M. (2018). "A Call for Clarity in Reporting BLEU Scores". WMT 2018.
Popovic, M. (2015). "chrF: character n-gram F-score for automatic MT evaluation". WMT 2015.
Popovic, M. (2016). "chrF deconstructed: beta parameters and n-gram weights". WMT 2016.
Popovic, M. (2017). "chrF++: words helping character n-grams". WMT 2017.
Lin, C. Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries". Workshop on Text Summarization Branches Out.
Banerjee, S., & Lavie, A. (2005). "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments". ACL Workshop 2005.
Denkowski, M., & Lavie, A. (2014). "Meteor Universal: Language Specific Translation Evaluation for Any Target Language". WMT 2014.
sacreBLEU project (GitHub). A library that provides a comparable, standardized implementation of BLEU, chrF, and TER. It is widely used as a standard for handling multiple references and managing signatures in practice.

Last Modified: April 17, 2026