charsim: A character n-gram similarity robust to parameter choices

This article introduces charsim, a surface-level string similarity metric proposed by imos. By measuring character n-gram overlaps, charsim quantifies how closely a candidate string matches a set of reference sentences. It can be formulated in two primary ways:

Base form (charsim_base): Computes the minimum of precision and recall, $\min(P,R)$, for each reference individually, and then averages these scores across all references (§2).
Main definition (mean-length version): Treats the entire reference set as a single averaged profile. It first averages the overlap and the reference length separately, and then normalizes the mean overlap by $\max(L_t,\bar L_r)$ (§3). Because the reference set is aggregated into a single average target, evaluating against a mixed set of references with widely varying lengths or surface forms will result in a lower score. A major advantage of this version is that the reference side can be precomputed, enabling significantly faster evaluation.

In this article, charsim alone refers to the mean-length version. When distinguishing it from the base form, we explicitly call the latter charsim_base.

The main definition is given by the following expression:

\[ \boxed{\;\operatorname{charsim}(t;\mathcal R) = \frac{\bar M}{\max\bigl(L_t,\ \bar L_r\bigr)} = \min\Bigl(\underbrace{\tfrac{\bar M}{L_t}}_{\text{precision}},\ \underbrace{\tfrac{\bar M}{\bar L_r}}_{\text{recall}}\Bigr)\;} \]

Here, $\bar M$, $L_t$, and $\bar L_r$ are obtained by computing the overlap, the candidate n-gram count, and the reference n-gram count for each order $n$. These values are multiplied by a weight of $1/n$ and summed from $n=1$ to $N$: \[ \bar M = \sum_{n=1}^{N}\frac{1}{n}\,\bar m_n,\qquad L_t = \sum_{n=1}^{N}\frac{1}{n}\,\ell_{t,n},\qquad \bar L_r = \sum_{n=1}^{N}\frac{1}{n}\,\bar \ell_{r,n} \] where $\bar m_n$ is the mean overlap with the reference set for order-$n$ n-grams, $\ell_{t,n}$ is the number of order-$n$ n-grams in the candidate, and $\bar \ell_{r,n}$ is the mean number of order-$n$ n-grams across the reference set. Dividing $\bar M$ by the candidate length $L_t$ gives precision, dividing it by the mean reference length $\bar L_r$ gives recall, and dividing it by the larger of the two yields $\min(P,R)$. Note that a "character" refers to a single Unicode code point (see §1.1 for details on normalization and whitespace handling).

Motivation and Design Goals

When evaluating a generative model's output against reference sentences, we need to quantify how closely a candidate string matches a small set of references. A suitable metric for this task should satisfy the following four properties:

It assigns higher scores to candidates that more closely match the surface form of the references.
It penalizes candidates that are too short relative to the references (indicating missing information).
It penalizes candidates that are too long (indicating needless padding).
It does not overrate candidates that fail to capture the surface features shared by the majority of the reference set.

While existing surface-level metrics like BLEU[1] and chrF[2] address these properties to some extent, they face challenges such as awkward length penalties, discontinuous score drops when high-order n-gram matches fall to zero, and ambiguous handling of multiple references (§5). charsim integrates all these considerations naturally into a single mathematical expression. Its core idea—embodied in its base form—is to calculate the minimum of precision and recall for each reference individually, and then average these values across all references. In this context, precision and recall refer to the n-gram overlap divided by the candidate length and the reference length, respectively (§2.3). Taking the minimum of the two automatically penalizes both excessive length and information shortfall. In the rest of this article, we will first detail the base form, introduce the mean-length version (which is better suited for efficient implementation) as the main definition, and finally compare charsim with BLEU and chrF.

1. Notation

1.1 Input

Let $t$ denote the candidate string being evaluated, $r$ a reference sentence, and $\mathcal{R} = (r^{(1)}, \dots, r^{(K)})$ a sequence (multiset) of multiple reference sentences.

Unless otherwise noted, a "character" refers to a single Unicode code point. To ensure consistency across different implementations, charsim does not apply any intrinsic text normalization. Any necessary normalization should be handled during task-specific preprocessing. For rigorous numerical comparisons, it is essential to report the exact normalization steps applied, including how whitespace and letter casing were handled, as these directly affect the evaluation conditions. In the following sections, we treat $t$ and $\mathcal R$ as fixed, and therefore omit them from the arguments of quantities that depend on them (e.g., the overlap $m_{i,n}$, the weighted overlap $M_i$).

1.2 Character n-grams

For any string $x$, we extract character n-grams of order $n = 1, 2, \dots$. Although there is no theoretical upper limit to the order, in practice, we cap it at a sufficiently large constant $N$ (§2.5).

To capture boundary information, we pad the string with one boundary symbol on each side before extracting n-grams. For instance, if we denote the start boundary as ^ and the end boundary as $, the string cat becomes ^cat$. Its 3-grams are ^ca, cat, and at$, and its 2-grams are ^c, ca, at, and t$. Regardless of the n-gram order, exactly one boundary symbol is added to each end. Each extracted n-gram is counted individually. (Note: ^ and $ are used here for illustration; an actual implementation must use distinct boundary markers that do not collide with any characters in the input text.)

Because every string receives exactly one boundary on each side, counting boundary-only n-grams would cause all strings to spuriously match on their boundaries. To prevent this, we exclude any n-gram composed exclusively of boundary symbols. For an empty string $x=\epsilon$, its padded form ^$ consists entirely of boundary symbols and is thus excluded, resulting in an n-gram count of 0 across all orders. Consequently, two strings that share no actual characters will yield a score of exactly 0.

The number of order-$n$ n-grams is given by the following formula, which derives from the padded string length $|x|+2$ minus the excluded boundary-only portion:

\[ \ell_{x,n} = \begin{cases} 0 & (|x| = 0)\\[2pt] |x| & (n = 1)\\[2pt] |x| + 3 - n & (2 \le n \le |x| + 2)\\[2pt] 0 & (n > |x| + 2) \end{cases} \]

For $n=1$, the n-grams represent the multiset of individual characters, so $\ell_{x,1}=|x|$. The number of n-grams decreases as the order $n$ increases. Once $n > |x|+2$, no further n-grams can be formed, and the count drops to 0. Since a shorter string naturally exhausts its n-grams at a lower order, this property establishes a natural upper bound for $n$ (§2.5).

1.3 n-gram frequency and overlap

Let $c_x(g)$ denote the occurrence count of a character n-gram $g$ in string $x$. While aggregated order-dependent variables like $m_{i,n}$ include a subscript $n$, we do not explicitly mark the order of $g$, as it is inherently defined by its length. From here on, $\sum_g$ denotes the summation over all n-grams of the current order, excluding those composed entirely of boundary symbols.

When the same n-gram $g$ appears multiple times within a sentence, charsim counts each occurrence. The order-$n$ overlap between two sentences $x$ and $y$ is defined as the number of elements they share as multisets:

\[ \sum_g \min\bigl(c_x(g),\, c_y(g)\bigr) \]

This expression applies no weighting. Even if a fragment repeats many times in one sentence, its overlap is strictly capped by its occurrence count in the other sentence. This core design principle is shared with both BLEU and chrF.

2. Base form

2.1 Overlap with each reference

The order-$n$ overlap with a reference $r^{(i)}$ is defined below. This value measures how much the character n-gram multisets of the candidate and the reference have in common.

\[ m_{i,n} = \sum_g \min\bigl(c_{r^{(i)}}(g),\, c_t(g)\bigr) \]

2.2 Weight by 1/|s| and sum across orders

Next, each n-gram $s$ is weighted by $1/|s| = 1/n$, and these weighted values are summed together. Since order $n$ contains $\ell_{x,n}$ n-grams, their total weighted sum equals $\ell_{x,n}/n$. By summing up to an upper bound $N$ (§2.5), the overlap, the candidate n-gram count, and the reference n-gram count are each consolidated into a single weighted scalar:

\[ M_i = \sum_{n=1}^{N}\frac{1}{n}\,m_{i,n},\qquad L_t = \sum_{n=1}^{N}\frac{1}{n}\,\ell_{t,n},\qquad L_i = \sum_{n=1}^{N}\frac{1}{n}\,\ell_{i,n} \]

2.3 Normalizing by length to obtain min(precision, recall)

The score for reference $i$ is defined by normalizing the weighted overlap by the larger of the two weighted lengths:

\[ s_i = \frac{M_i}{\max\bigl(L_t,\, L_i\bigr)} \]

Because $m_{i,n} \le \ell_{t,n}$ and $m_{i,n} \le \ell_{i,n}$ hold for every order, summing them with $1/n$ weights guarantees that $M_i \le L_t$ and $M_i \le L_i$. This yields the following relationship:

\[ M_i \le \min\bigl(L_t, L_i\bigr) \le \max\bigl(L_t, L_i\bigr) \]

This relation guarantees that $0 \le s_i \le 1$ always holds.

We can define precision and recall as follows:

\[ P_i = \frac{M_i}{L_t}, \qquad R_i = \frac{M_i}{L_i} \]

By choosing the larger value for the denominator, the resulting fraction is minimized. Therefore:

\[ s_i = \frac{M_i}{\max(L_t, L_i)} = \min\bigl(P_i,\, R_i\bigr) \]

This is the core concept behind charsim. Crucially, $\min(P,R)$ is not computed order-by-order, but rather on the weighted quantities after they have been aggregated across all orders. If the candidate is too long, the denominator becomes $L_t$, naturally applying a precision-based penalty. Conversely, if the candidate is too short, the denominator becomes $L_i$, applying a recall-based penalty.

2.4 Combining multiple references

The final multi-reference score is calculated by averaging the scores of the individual references:

\[ \operatorname{charsim}_{\mathrm{base}}(t;\mathcal R) = \frac{1}{K}\sum_{i=1}^{K} s_i = \frac{1}{K}\sum_{i=1}^{K} \frac{M_i}{\max\bigl(L_t, L_i\bigr)} = \frac{1}{K}\sum_{i=1}^{K} \min\bigl(P_i, R_i\bigr) \]

This formulation mathematically represents the expected value of $\min(P,R)$ when a single reference is drawn at random (§5.3).

2.5 Order truncation and exact matches

Since n-grams are exhausted by order $n \le |x|+2$ even with boundary symbols (§1.2), the summation naturally terminates. The maximum possible order in any given comparison is:

\[ N' = \max\bigl(|t|,\ \max_i |r^{(i)}|\bigr) + 2 \]

For any order beyond $N'$, $\ell_{x,n}=0$, so those terms do not affect the score. Thus, in theory, an explicit cap on the order is unnecessary. However, generating and matching n-grams up to $N'$ for very long inputs is computationally expensive (§7.5). Therefore, practical implementations cap the order at a constant value. The only strict requirement is that this cap should be significantly larger than chrF's default (character 6-grams). Because the match rate drops at higher orders and the $1/n$ weighting further diminishes their impact (unlike uniform weighting; see §5.7 and §6), raising the cap beyond a certain point barely changes the score for typical natural-language text while merely increasing computation time (§7.5). Since finding a "perfect" cap offers little practical benefit, this article adopts a default cap of 32:

\[ N = \min\bigl(N',\ 32\bigr) \]

For short strings, $N=N'$, so the cap is never reached (as demonstrated in the numerical example in §3.4). For longer inputs, $N=32$. However, if evaluating texts where high-order matches are highly meaningful—such as set phrases, repetitive text, or source code—the cap can be reasonably increased.

An exact match automatically results in a score of 1. If $t=r$, then $m_n = \ell_{t,n} = \ell_{r,n}$ across all orders, making $M = L_t = L_r$ and thus $s = 1$. Because the identical $1/n$-weighted sums in the numerator and denominator cancel each other out, an exact match always scores exactly 1, regardless of where the cutoff order $N$ is set.

It is also worth noting that a shorter reference will exhaust its n-grams at $n > |r^{(i)}|+2$, making $\ell_{i,n}=0$ beyond that point (§1.2). As a result, only the longer candidate will continue to accumulate weight at higher orders. This drives $L_t$ higher and reduces precision, effectively serving as an additional penalty against needless padding (§4.3).

2.6 Properties of the base form

Length correction is applied precisely on a per-reference basis, since the denominator is calculated individually for each reference.
The aggregated score can be directly interpreted as an expected $\min(P, R)$.
An exact match against any single reference yields a score of exactly 1.

Its primary drawback is computational. Because the denominator $\max(L_t, L_i)$ changes depending on the specific reference, the base form does not allow the reference set to be efficiently precomputed for rapidly scoring many different candidates. For this reason, the mean-length version—detailed in the next section—is adopted as the main practical definition.

3. Main definition (mean-length version)

3.1 Averaging the numerator and denominator separately

While the base form computes the scores for individual references and then averages them, the mean-length version first averages the numerator (overlap) and the denominator (length) separately, performing the division only at the end. Therefore, it does not measure the average of the per-reference normalized scores, but rather the candidate's match against the reference set's "average profile"—specifically, the mean overlap normalized by the mean reference length. This is a distinct approach rather than just an approximation of the base form; it shifts the unit of aggregation from individual references to the overall average of the set (see §3.3 for their detailed relationship and assumptions). The numerator, $\bar M$, represents the mean overlap, calculated by summing the $1/n$-weighted overlaps:

\[ \bar m_n = \frac{1}{K}\sum_{i=1}^{K} m_{i,n},\qquad \bar M = \sum_{n=1}^{N}\frac{1}{n}\,\bar m_n \]

The denominator, $\bar L_r$, is the mean order-$n$ n-gram count across the reference set, $\bar \ell_{r,n}$, similarly summed with the $1/n$ weights:

\[ \bar \ell_{r,n} = \frac{1}{K}\sum_{i=1}^{K} \ell_{i,n},\qquad \bar L_r = \sum_{n=1}^{N}\frac{1}{n}\,\bar \ell_{r,n} \]

Accordingly, the score is defined as follows:

\[ \boxed{\;\operatorname{charsim}(t;\mathcal R) = \frac{\bar M}{\max\bigl(L_t,\, \bar L_r\bigr)} = \min\bigl(P, R\bigr),\qquad P = \frac{\bar M}{L_t},\ R = \frac{\bar M}{\bar L_r}\;} \]

Since $m_{i,n} \le \ell_{t,n}$ and $m_{i,n} \le \ell_{i,n}$ for each reference, averaging ensures that $\bar M \le L_t$ and $\bar M \le \bar L_r$, thereby guaranteeing $0 \le \operatorname{charsim} \le 1$. Expanded form:

\[ \operatorname{charsim}(t;\mathcal R) = \frac{\displaystyle\sum_{n=1}^{N}\frac{1}{n}\cdot\frac{1}{K}\sum_{i=1}^{K}\sum_g \min\bigl(c_{r^{(i)}}(g),\,c_t(g)\bigr)}{\max\Bigl(\displaystyle\sum_{n=1}^{N}\frac{1}{n}\ell_{t,n},\ \sum_{n=1}^{N}\frac{1}{n}\bar \ell_{r,n}\Bigr)} \]

3.2 Why use the mean length

We adopt the mean as the representative reference length. No matter which representative value is chosen, $m_{i,n}\le \ell_{t,n}$ implies $\bar M\le L_t$. Because the denominator is always at least $L_t$, $\operatorname{charsim}\le1$ is universally guaranteed. The mean is specifically chosen because the numerator $\bar M$ already represents an average of the per-reference overlaps. Aligning the denominator with the mean length ensures that $\bar M\le\bar L_r$ also holds, allowing the recall component to be clearly interpreted as "the mean overlap relative to the mean reference length." Furthermore, when the reference set is static, $\bar L_r$ can be precomputed, simplifying implementation (§7). While one could conceive of a variant using the median, mixing different aggregation methods for the numerator (mean) and denominator (median) makes the resulting formula harder to interpret (§7.3).

3.3 Relationship to the base form and conditions for application

By altering the order of aggregation, the mean-length version is not merely an approximation of the base form. While the base form follows a "divide then average" approach, the mean-length version uses an "average then divide" methodology.

The former quantifies the expected value of $\min(P,R)$ when randomly selecting a single reference, whereas the latter measures how well the candidate matches the average profile of the entire reference set. We adopt the mean-length version as the main definition for three key reasons:

The denominator collapses to a single value, making the mathematical expression concise.
The reference-side quantities can be precomputed, which makes evaluation easy to speed up (§7).
Its length-dependent penalties (acting like recall for short candidates and precision for long candidates) behave identically to those in the base form.

The two scores diverge most noticeably when reference lengths vary widely. For example, evaluating the candidate a against the reference set ["a", "abcdefghij"] (one short, one long) yields approximately 0.528 in the base form, but only 0.134 in the mean-length version. This happens because the mean-length version heavily penalizes any candidate that deviates from the average profile; thus, mixing references of vastly different lengths drives the score down, even if the candidate perfectly matches one of them. Whether this divergence is practically harmful depends on the task. However, if reference lengths vary substantially, it is safer to avoid the mean-length version and instead use the base form or cluster the references by length.

For "matching the average profile" to be a meaningful concept, three conditions should ideally be met: (a) reference lengths should be of comparable magnitude; (b) the mean should be a reasonable representative value (i.e., the distribution shouldn't be multimodal or dominated by outliers); and (c) drastically different correct answers shouldn't be lumped into the same set. If these conditions are violated, you should consider splitting the reference set, applying weights, or reverting to the base form (§5.3).

3.4 A small numerical example

Let's consider a single reference $\mathcal R = [\texttt{cat}]$ (since $K=1$, the base form and mean-length version are identical). We will vary the candidate $t$ across four cases—cat (exact match), ca (one character short), cats (one character over), and catcat (doubled padding)—and compare their $\operatorname{charsim}$ scores.

Let's examine the too-short candidate ca in detail. The n-grams for cat (padded as ^cat$) extend up to $n=5$, making $N'=\max(2,3)+2=5\le 32$, so $N=5$. The table below shows the order-by-order overlap $m_n$ (§2.1), the n-gram counts $\ell_{t,n}$ and $\ell_{r,n}$ (§1.2), and the result of multiplying each row by $1/n$. Note that ca is evaluated as ^ca$ and cat as ^cat$. Because $K=1$, the barred quantities equal their single-reference counterparts ($\bar M = M$, $\bar L_r = L_r$), but we retain the barred notation for consistency with the main definition.

Order $n$	1	2	3	4	5
$m_n$	2	2	1	0	0
$\tfrac1n m_n$	2	1	1/3	0	0
$\ell_{t,n}$	2	3	2	1	0
$\tfrac1n \ell_{t,n}$	2	3/2	2/3	1/4	0
$\ell_{r,n}$	3	4	3	2	1
$\tfrac1n \ell_{r,n}$	3	2	1	1/2	1/5

Summing the weighted rows across all $n$ yields $\bar M = 2+1+\tfrac13 = \tfrac{10}{3}$, $L_t = \tfrac{53}{12}$, and $\bar L_r = \tfrac{67}{10}$. Because the candidate is short ($L_t < \bar L_r$), the denominator becomes $\bar L_r$, and the recall component dictates the penalty:

\[ P = \frac{10/3}{53/12} = \frac{40}{53} \approx 0.755\\ R = \frac{10/3}{67/10} = \frac{100}{201} \approx 0.498\\ \operatorname{charsim}(\texttt{ca};[\texttt{cat}]) = \min(P,R) \approx 0.498 \]

Performing the same computation for all four candidates yields the results below. For the excessively long candidates cats and catcat, $L_t > \bar L_r$, so the precision-side penalty is triggered (§4.3).

Candidate $t$	Description	Active side	$N'$	$\operatorname{charsim}$
`cat`	exact match	$P=R=1$	5	1.000
`cats`	one character over	precision	6	0.592
`ca`	one character short	recall	5	0.498
`catcat`	double-length padding	precision	8	0.449

Only an exact match achieves a score of exactly 1; any deviation in length—whether short or long—automatically lowers the score via $\min(P,R)$. Notably, even though catcat fully contains the reference, its padding triggers a significant precision penalty. Furthermore, because the reference exhausts its n-grams at high orders and cannot contribute further overlap (§2.5), the multiset overlap is hard-capped by the reference's occurrence counts (§4.6), causing the score to drop substantially.

4. Behavior of the mean-length version

This section details how the main definition, $\operatorname{charsim} = \bar M / \max(L_t, \bar L_r) = \min(P,R)$, behaves across various scenarios.

4.1 When length and content are both close

If $L_t \approx \bar L_r$ and the overlap is near its maximum, then $\bar M \approx L_t \approx \bar L_r$, leading to $\operatorname{charsim} \approx 1$. As intended, higher similarity yields a score closer to 1.

4.2 When the candidate is too short

When $L_t < \bar L_r$, the denominator becomes $\bar L_r$, meaning $\operatorname{charsim} = \bar M / \bar L_r$. Even if the candidate is a perfect prefix of the reference, its weighted overlap is bounded by $L_t$, leading to:

\[ \operatorname{charsim} \lesssim \frac{L_t}{\bar L_r} \]

Thus, missing length directly translates into a recall-based penalty.

4.3 When the candidate is too long

When $L_t > \bar L_r$, the denominator becomes $L_t$, making $\operatorname{charsim} = \bar M / L_t$. Even if the candidate fully contains the reference but includes extraneous trailing text, its overlap saturates near the reference length, resulting in:

\[ \operatorname{charsim} \lesssim \frac{\bar L_r}{L_t} \]

Consequently, needless padding triggers a precision-based penalty. Moreover, if the reference is short, its n-grams are exhausted at higher orders, so only the longer candidate continues to contribute weight. This further inflates $L_t$, strengthening the penalty (§2.5).

4.4 How much the score rises as matches increase

If we treat length as fixed, the denominator $\max(L_t, \bar L_r)$ becomes a constant $D$. Under this condition, $\operatorname{charsim} = \bar M / D$ scales linearly with the weighted overlap $\bar M$. By decomposing $\bar M = \sum_n \frac1n \bar m_n$ by n-gram order, we get:

\[ \operatorname{charsim} = \sum_{n=1}^{N} \frac{1}{n}\cdot\frac{\bar m_n}{D} \]

Thus, each order's contribution, $\frac{\bar m_n}{nD}$, is summed linearly. For a fixed length and fixed denominator $D$, every new order-$n$ overlap match adds exactly $\frac{1}{nD}$ to the score, allowing us to easily trace the impact of each order. (Note that in reality, modifying a single character changes n-grams across multiple orders simultaneously, and changing the candidate's length also alters $D$.)

4.5 Exact matches and their neighborhood

For a single reference $\mathcal R = [r]$, if $t = r$, then $m_n = \ell_{r,n}$ for all orders, which means $\bar M = L_t = \bar L_r$ and $\operatorname{charsim}(r; [r]) = 1$. If we begin introducing single-character errors, only the n-grams spanning those specific errors are lost. Because this happens order-by-order, the score does not abruptly crash to 0; instead, it decreases smoothly in proportion to the lost overlap. (However, a single-character edit can still have a large relative impact on very short strings).

4.6 Repeated fragments

Because overlap is computed using $\min(c_r(g), c_t(g))$, a repeated fragment is counted as many times as it appears—but strictly capped at its total occurrences in the reference. Repetitions that exist in the reference are properly credited, while artificially repeating a fragment in the candidate side yields no extra points. A metric that multiplied raw occurrence counts could be inflated simply by repeating the most frequent n-gram, but taking the multiset overlap ($\min$) prevents this. This property isn't unique to charsim; it is shared by BLEU, which clips candidate counts against the reference, and chrF, which also uses $\min$ for overlap.

5. Differences from BLEU / chrF

charsim offers several advantages: what it measures is directly expressed in its formula, it does not collapse during sentence-level evaluation, and its n-gram order cap is easy to set (since the choice of cap has relatively little impact). This section compares charsim against the standard designs of BLEU[1], chrF[2], and chrF++[3].

5.1 Length handling is integrated into a single expression

At its core, BLEU is a precision-based metric that handles length shortfall using a separate "brevity penalty" term. chrF calculates precision and recall independently, then combines them into an F-score. In contrast, charsim uses the following normalization after aggregation:

\[ \frac{M_i}{\max(L_t, L_i)} = \min\left( \frac{M_i}{L_t},\, \frac{M_i}{L_i} \right) \]

This approach integrates both length excess and length shortfall directly into a single expression. A defining feature of charsim is that length consistency is intrinsically baked into its definition of "similarity".

5.2 It does not break down at the sentence level

In BLEU, if the match count for any high-order n-gram drops to 0, that order's precision becomes 0. Due to the geometric mean, this causes the entire sentence-level BLEU score to collapse to 0 (which is why smoothing is treated as all but essential for sentence-level evaluation). Conversely, charsim is a ratio of summed weighted overlaps to summed weighted lengths. If high-order matches drop to 0, that specific order simply contributes nothing to the numerator, preventing the overall score from exhibiting discontinuous drops.

5.3 The meaning of multiple references is mathematically clear

BLEU handles multiple references in a union-like manner by taking the maximum reference count per word, while chrF's handling often depends on the specific implementation (e.g., using the best reference or the mean). By contrast, the base form of charsim, $\frac{1}{K}\sum_i \min(P_i, R_i)$, mathematically represents the expected value of $\min(P, R)$ when a single reference is selected at random. Although the mean-length version averages first and does not exactly match this expectation, it similarly avoids overrating a candidate that coincidentally matches only one specific reference.

5.4 Each order's contribution is linear and easy to analyze

Because the final form is $\operatorname{charsim} = \frac1D\sum_n \frac{1}{n}\bar m_n$ (where $D$ is a length-determined constant), we can easily observe how much each order influences the score. charsim sums the per-order contributions linearly without rescaling each order into $[0,1]$, ensuring a clean decomposition of factors. In BLEU, the geometric mean causes a sharp score drop if even a single order scores low, while chrF's F-score obscures the breakdown by merging precision and recall at the end. charsim clearly exposes each order's contribution, making it simpler to analyze score changes.

5.5 Character-based and tokenizer-independent

Like chrF, charsim operates on character n-grams. This allows it to gracefully handle inflections, spelling variants, minor typos, and different compound-word splits without requiring a tokenizer. Its key distinction from chrF is its use of $\min(P,R)$. An F-score can still yield a reasonable value if only one of precision or recall is high, but $\min(P,R)$ is more conservative: it drags the score down to the lower of the two. Regardless of $\beta$, they relate as follows:

\[ \min(P, R) \le F_\beta \le \max(P, R) \]

While chrF uses F-score to balance precision and recall, charsim consistently adopts the lower bound. This is what we mean by "conservative." On the flip side, metrics that incorporate word n-grams (like chrF++) capture word order and word-level matches better than charsim, which only looks at characters. Consequently, if strict word-order accuracy is essential, charsim has limitations (§8).

5.6 The n-gram order cap is easy to set

Both BLEU and chrF require a human to manually set the maximum n-gram order. Typically, BLEU uses word 4-grams and chrF uses character 6-grams, aggregating them with uniform weights (BLEU via the geometric mean, chrF via the arithmetic mean before calculating the F-score). Neither uses a decaying $1/n$ weight like charsim. As a result, their cap $N$ acts as a hyperparameter that directly scales the score. This is problematic because the "ideal" n-gram order varies drastically across languages and writing systems:

Word segmentation: BLEU's word 4-grams make sense for English. But for Japanese, Chinese, Thai, and other languages without spaces between words, the definition of a "4-gram" depends heavily on tokenizer settings. While chrF switched to character-based evaluation to bypass this, the "character 6-gram" standard is itself unbalanced across different languages.
Morphological differences: The number of characters per morpheme varies by language. German words like Lebensversicherung (18 characters) and short English words require completely different orders $n$ to capture equivalent linguistic information.
Writing system density: Just as a single Kanji character can convey as much information as several Latin letters, information density per character varies. Therefore, a "character n-gram" of a fixed $n$ does not carry uniform meaning across languages.

Deploying BLEU or chrF to a new language or domain thus requires retuning the order cap every time, making cross-language score comparisons difficult. Because charsim is relatively insensitive to the choice of cap, it significantly reduces this tuning effort (§5.7). While the cap cannot be set to infinity, for typical natural-language text, increasing it beyond a certain point has negligible impact on the score. This article naturally sums up to the point where n-grams run out, $N'=\max(|t|,\max_i|r^{(i)}|)+2$, but imposes a cap of 32 to manage computational costs, yielding an effective cap of $N=\min(N',32)$ (§2.5). Note that this simply indicates low sensitivity to the choice of order; it does not mean the absolute score is perfectly comparable across different string lengths or languages (§5.7, §8).

5.7 Why it is robust to the choice of cap

charsim's aggregation method is robust to the choice of cap, allowing it to be set to a large value comparable to the input length. This does not imply that $N$ can be extended infinitely, but rather that for typical natural-language text, the score stabilizes once the cap is large enough. There are two reasons for this:

(1) The ratio form avoids collapse. Because the score is a ratio of summed weighted overlaps to summed weighted lengths, extending the cap until high-order matches reach 0 merely zeroes out that order's contribution to the numerator (while the denominator may grow slightly). As a result, the overall score changes smoothly. In contrast, BLEU's geometric mean collapses the entire score to 0 if even one high-order match is 0, which is ultimately why BLEU requires a smaller $N$ cutoff (§5.2).

(2) The $1/n$ weighting concentrates weight on low orders. The weight of a substring $s$, $1/|s|=1/n$, diminishes as the order increases. Consequently, high-order terms contribute only a small fraction to the weighted length $L_t=\sum_n \ell_{t,n}/n$. Assuming $\ell_n$ remains approximately constant, the proportion of weight from orders above $N_0$ is $1-H_{N_0}/H_N$ for charsim, compared to $1-N_0/N$ for the uniform weighting scheme used by BLEU and chrF (where $H_N=\sum_{n\le N}1/n$).

Weight occupied by orders > 4	cap 8	16	32	64
charsim ($1/n$ weight)	0.23	0.38	0.49	0.56
BLEU / chrF (uniform)	0.50	0.75	0.88	0.94

Under uniform weighting, extending the cap lets high-order terms dominate more and more ($1-N_0/N\to1$), so the cap directly scales the score. For charsim, the high-order share $1-H_{N_0}/H_N$ is smaller, but not negligible (0.56 at $N=64$ in the table above), so the $1/n$ weighting alone does not establish robustness to the cap. What reason (2) guarantees in theory is only that high orders count for less than under uniform weighting. Robustness to the cap in practice rests on an additional empirical fact: for most natural-language text, the high-order overlap $\bar m_n$ is nearly 0. Where that holds, moving the cutoff around does little to the aggregate score, so the cap does not need to be tuned precisely—which is the point of §5.6.

6. A length bias seen through a simplified model

In text-similarity evaluation, we would like the score—with content held fixed—to peak at an appropriate length and to fall off on both sides: too short signals lost information, too long signals needless padding. To study this, we simplify the content-match pattern, vary only the string length, and compare the length-penalty behavior of charsim, BLEU, and chrF. This is not an approximation of real scores but a simple experiment probing how each metric combines precision and recall; it does not establish that any metric is generally superior.

6.1 Setup: matchable n-grams and their overlap

The order-$n$ n-gram count is $\ell_{x,n}=|x|+3-n$ ($2\le n\le|x|+2$), which in the low-order range ($n\ll|x|$) is $\approx|x|$ (§1.2). Although the exact count per order is $|x|+3-n$, here we make a crude approximation—assuming low orders dominate ($n\ll|x|$)—and treat the total pool of matchable n-grams as the character count itself (0 for the empty string):

\[ \tilde L_x = |x| \]

Next, we simplify the overlap: whatever the candidate length, a fixed fraction $p$ of the matchable n-grams—the shorter of the two lengths, $\min(\tilde L_t,\tilde L_r)$—actually match.

\[ m_n = p\cdot\min\bigl(\tilde L_t,\ \tilde L_r\bigr) = p\cdot\min(\hat\lambda,\lambda), \qquad \tilde L_t=\hat\lambda,\ \ \tilde L_r=\lambda \]

Holding the match rate constant at $p$ whether the candidate is over- or under-length lets us isolate the pure effect of length. This treats excess and shortfall symmetrically and fits naturally with charsim's use of the longer string in the denominator. Note that charsim sums overlap and length with the $1/n$ weights, but if the match rate $p$ is constant across orders, a common factor $\sum_n 1/n$ appears in both the numerator and the denominator and cancels, leaving an expression in $\hat\lambda$ and $\lambda$ alone (shown below); BLEU and chrF, by contrast, work on the raw n-gram counts without such a correction. Realistically, the match probability drops for longer n-grams, so $p$ decays roughly as $q^n$ (with $q$ the per-character match rate). Accounting for this decay does not change the conclusion about the optimal length, but this high-order matching pattern is central to the discussion of "overrating from contiguous matches" in §6.4.

6.2 The peak lands at the reference length

charsim, BLEU, and chrF all reach their maximum when the candidate length equals the reference length ($\hat\lambda=\lambda$), and that maximum is the same for all three: $p$. For charsim, under the assumption that $p$ is constant, we have $\bar M = p\min(\hat\lambda,\lambda)\sum_n\frac1n$, $L_t=\hat\lambda\sum_n\frac1n$, and $\bar L_r=\lambda\sum_n\frac1n$, so $\sum_n 1/n$ cancels and the simplified-model expression becomes:

\[ \operatorname{charsim} = p\,\frac{\min(\hat\lambda,\lambda)}{\max(\hat\lambda,\lambda)} \]

At $\hat\lambda=\lambda$, precision and recall coincide at $p$, and the score falls off on either side of that reference length. BLEU suppresses shortfall through the brevity penalty and over-length through precision, and chrF combines precision and recall into an F-score, so their peaks also sit at $\hat\lambda=\lambda$. Since all three share both the position and the height of the peak, this feature alone cannot rank them. The difference emerges in which direction of error each is lenient toward as the length moves away from the peak.

6.3 Tolerance for too long vs. too short

When the candidate is too long, extra n-grams appear and precision falls while recall holds; when it is too short, information is lost and recall falls while precision holds. So whether a metric leans on precision or recall decides whether it is lenient toward excess or toward shortfall. Plotting the score against candidate length $\hat\lambda$ at reference length $\lambda=100$ and match rate $p=0.8$ gives the following. (The BLEU and chrF curves here are not the outputs of real implementations but model expressions for the length-penalty behavior under constant $p$: chrF with $\beta=2$, and BLEU as precision $\times$ brevity penalty. The order cap $n\le20$ is a cutoff on the BLEU/chrF side, which aggregate raw n-gram counts; under constant $p$, $\sum_n 1/n$ cancels, so it does not affect charsim's score.) The red dashed line is a symmetric guide descending linearly to 0 from the peak $(\lambda, p)$ on both sides.

Candidate length λ̂ and score (simplified model, λ=100, p=0.8)

chrF ($\beta=2$) is lenient toward over-length and strict toward shortfall. Weighting recall twice as heavily as precision, it penalizes too-short output strongly but is lenient toward too-long output, where only precision falls. Since lengthening the string keeps recall at $p$, the score holds up. In the right (over-length) region of the graph, only chrF (green) stays high: at 30 characters off the reference length, the score is 0.76 on the too-long side versus 0.60 on the too-short side—a bias toward favoring long output.

BLEU is strict toward shortfall, giving a shape asymmetric about the peak. Its core, precision, penalizes over-length only gently, while the separate brevity-penalty term punishes shortfall sharply, so only the too-short side drops steeply (0.015 at $\hat\lambda=20$, 0.078 at $30$). Because the evaluation core and the correction penalty are separate structures, the curve behaves differently on the left and right; the mechanism differs from chrF's, but the two share the trait of heavily penalizing too-short output, which likewise ends up rewarding longer output.

charsim is based on $\min(P,R)$, so it never favors precision or recall when combining them. Taking the lower of the two directly, the precision lost to over-length and the recall lost to shortfall act with equal weight. Because it measures excess and shortfall in the same expression and on the same scale, the curves to the left and right of the peak join smoothly, and in this simplified model the shape is nearly symmetric. (In the actual definition, boundary handling, the exhaustion of high-order n-grams, the cap of 32, and the $1/n$ weighting all have effects, so it is not exactly symmetric.) Here $\operatorname{charsim}=p\,\min(\hat\lambda,\lambda)/\max(\hat\lambda,\lambda)$, so the too-short side ($\hat\lambda\le\lambda$) lies exactly on the red dashed line, and the only asymmetry is on the too-long side, where it is ever so slightly more lenient.

This difference shows up in the "centroid" $\bar\mu$, the score-weighted mean of the candidate length over the neighborhood $[50,150]$ of the correct value (with the reference length fixed at 100, the centroid's reference value is 100). The figures below are all computed from this section's model expressions at $\lambda=100$, $p=0.8$, $n\le20$, rounded to one decimal place.

\[ \bar\mu_{\text{chrF}}=103.3,\quad \bar\mu_{\text{BLEU}}=102.7,\quad \bar\mu_{\text{charsim}}=101.5 \]

Against the reference of 100, chrF leans furthest toward the long side, while charsim comes closest to 100. This does not settle any general ranking, but under this simplified model's conditions, chrF and BLEU tend to rate long output relatively higher, and charsim has the smallest length bias.

6.4 Even for the same match amount, it suppresses overrating from contiguous matches more than uniform weighting

Another evaluation concern is how the matched positions are distributed within the string, even at the same total match amount. If a coincidentally long contiguous match is present (such as a verbatim copy of a set phrase), that element alone might inflate—overrate—the score. To isolate this effect, we set both the candidate and reference lengths to $\lambda$, fix the number of matched characters (the 1-gram overlap count) to be identical, and vary only the match pattern, in two ways.

Diffuse: the matched positions are spread across the whole string. If each character matches independently with probability $q$, the match rate at order $n$ falls geometrically as $\approx q^n$, so $m_n\approx q^n\lambda$.
Contiguous block: only a single contiguous block of length $b=q\lambda$ matches exactly, and the rest does not match at all. Because the n-grams inside the block keep matching even at high orders, $m_n=\max(b-n+1,\,0)$, and the match count falls off only gently.

The two patterns share the same 1-gram match count, so they are indistinguishable at low orders; the difference emerges sharply at high orders. Since the candidate and reference lengths are equal, the denominator is common, and the overrating shows up as a difference in the weighted overlap $\bar M=\sum_n m_n/n$. Varying the cap $N$, we measure the overrating ratio (= score(contiguous block) / score(diffuse)). The closer this is to 1, the less the score is swayed by whether the match is a contiguous block or diffuse.

Overrating ratio from contiguous blocks (simplified model, λ=100, q=0.8). Uniform weighting rises to a high level, while charsim's 1/n weighting saturates low

With uniform weighting, extending the cap lets the contiguous block dominate more and more. When every order is summed with equal weight, as in chrF, the many high-order matches from a length-$b$ block feed directly into the score, so the overrating ratio climbs as $N$ grows and reaches a very high level (about 8.1× at $N=100$). A large cap under this weighting assigns an excessively high score to a coincidentally long match.

charsim's $1/n$ weighting curbs this overrating more strongly than uniform weighting does. As the weight shrinks with order, the high-order matches from a contiguous block are discounted, so at fixed length $\lambda$, however far you extend the cap $N$, the overrating ratio saturates early (about 2.0× at $\lambda=100$). Since a long contiguous match is itself somewhat desirable to reward, the fact that the ratio does not fully converge to 1—some gap remains—is reasonable behavior.

7. Implementation notes

7.1 Naive implementation

At each order $n$, do the following. (1) Build the candidate $t$'s character n-gram frequency table (pad the string with one boundary symbol on each side, and drop n-grams made only of boundary symbols). (2) Compute the overlap $m_{i,n}=\sum_g\min(c_x,c_y)$ against each reference $r^{(i)}$'s frequency table. (3) Count the n-gram totals $\ell_{t,n}$ and $\ell_{i,n}$. (4) Multiply each order by $1/n$ and accumulate into the weighted quantities $M_i$, $L_t$, $L_i$ (or $\bar M$, $L_t$, $\bar L_r$ for the mean-length version). Finally, apply the $\max$ normalization once to the aggregated values to obtain the score. (Set the cap to $N=\min(N',32)$; handle empty input and $K=0$ as in §7.4.)

7.2 The mean-length version is advantageous when the references are fixed

When a fixed reference set is used to score many candidate strings, the mean-length version is clearly advantageous. Each query needs only three variables: the weighted overlap $\bar M$, the weighted candidate length $L_t$, and the fixed reference-side value $\bar L_r$. Concretely, precompute the following function for each order $n$ and each n-gram $g$:

\[ F_g(m) = \frac{1}{K}\sum_{i=1}^{K} \min\bigl(c_{r^{(i)}}(g),\, m\bigr) \]

With these on hand, you can compute against the candidate-side count $c_t(g)$ via:

\[ \bar m_n = \sum_g F_g\bigl(c_t(g)\bigr),\qquad \bar M = \sum_{n=1}^{N}\frac{1}{n}\,\bar m_n \]

This way, the computation finishes by inspecting only the n-grams that actually occur in the candidate. Moreover, using $\min(c, m) = \sum_{a=1}^{m}\mathbf 1[c \ge a]$, you can represent $F_g$ compactly with a cumulative frequency table recording, for each $g$, how many references contain it at least once, at least twice, and so on. The $F_g(m)$ table need only cover up to the maximum reference-side occurrence count, and since most n-grams have $c=1$, it fits efficiently in a sparse structure. The reference-side weighted length $\bar L_r$ can likewise be precomputed to a constant. Holding all orders without limit, however, would push the reference-side computation and memory toward $O(KL^2)$, so capping the maximum order is essential for long text (§7.5).

7.3 The mean-length version vs. the median version

We adopt the mean-length version as the main definition, but if the reference lengths have extreme outliers and you want a more robust representative length, a variant with denominator $\max(L_t, \operatorname{median}_i L_i)$ is worth considering. Still, as long as the numerator is aggregated as a mean, making the denominator a median makes the formula harder to interpret than the main definition.

7.4 Edge cases

When implementing the empty-input convention of §3.1, add an up-front branch to avoid dividing by zero when the denominator $\max(L_t, \bar L_r)$ is 0—that is, when the candidate and every reference are empty ($|t| = 0$ and $|r^{(i)}| = 0$ for all $i$). A safe convention is: score 1 if both sides are empty, 0 if only one side is empty, and—when there is not a single reference ($K = 0$)—either throw (treating the result as undefined) or return a missing value. The following test cases serve as a minimal regression check at implementation time.

exact match = 1
no shared characters = 0
"" vs [""] = 1, "" vs ["a"] = 0 (verifying the empty-input convention above)
ca vs ["cat"] = 0.498 (the numerical example shown in §3.4)
a case where the base form and the mean-length version diverge greatly, with multiple reference sentences of differing characteristics (the verification in §3.3)

7.5 Computational complexity and truncation

With no constant cutoff—computing up to the order where n-grams run out, $N'=\max(|t|,\max_i|r^{(i)}|)+2$—a naive implementation needs $O(L^2)$ just to generate the n-grams of one sentence of length $L$. charsim is fundamentally meant for per-sentence or short-text evaluation, and for long text this $O(L^2)$ cost is a real constraint. The default constant cutoff (32, §2.5) also serves to hold n-gram generation to $O(32\,L)$. For an exact match, the $1/n$-weighted sums in the numerator and denominator cancel and the score is always 1, so the evaluation criterion is unaffected by whatever value the cutoff constant takes (§2.5). For ordinary natural-language text, the high-order overlap $\bar m_n$ tends to nearly 0, so the cutoff has little effect—but for input where high-order matches carry real meaning (set phrases, repeated expressions, program source code), the cutoff can affect the score.

8. Design limitations

charsim has several inherent limitations stemming from its core design. We explicitly outline these specific trade-offs below.

As a surface metric, it cannot grasp meaning or paraphrase: Because it judges solely based on character n-gram overlap, it naturally assigns low scores to paraphrased text that changes the surface wording, even if the underlying meaning is perfectly identical. This limitation is shared with BLEU and chrF; charsim is explicitly defined to measure strictly "surface similarity" and nothing more.
It is too punitive when any single correct answer is sufficient: Because charsim measures the candidate's average match against the entire reference set, mixing valid answers with vastly different surface forms into a single set will penalize the candidate, even if it perfectly matches one of them. In such cases, you should switch to the base form, use a "best-of" reference score, or cluster the references by length/content (§3.3).
Word order and long-distance dependencies are only captured within the n-gram window: The $1/n$ weighting scheme deliberately concentrates impact on low-order terms that lack global word-order information ($n=1$ is just the multiset of individual characters). Consequently, even if you completely shuffle the words of an English sentence, the low-order matches will still contribute to the score. This is the inescapable flip side of its core strength: avoiding score collapse by heavily discounting high orders (§5.7).
Extending the cap to $N'$ costs $O(L^2)$ on long texts: Because the number of candidate character n-grams grows as $O(L^2)$, capping the maximum order is essential when evaluating long documents (§7.5).

9. Summary

charsim is a surface-similarity metric that uses character n-gram overlap to measure how well a candidate matches the average profile of a set of reference sentences. Its main characteristics are as follows:

It pads strings with a single boundary symbol on each side and counts every n-gram individually (excluding those composed solely of boundary symbols).
It assigns each substring $s$ a weight of $1/|s|$, summing these across all orders to aggregate them into the weighted overlap $\bar M$ and the weighted lengths $L_t$ and $\bar L_r$.
In its base form, it calculates $\min(P, R)$ for each individual reference sentence and averages those scores.
In its main definition (the mean-length version), it defines $\operatorname{charsim} = \bar M / \max(L_t, \bar L_r) = \min(P,R)$, applying $\min(P,R)$ just once after aggregating each element.
When reference lengths are uniform, the main definition closely mimics the base form. When they vary widely, the two diverge, with the mean-length version acting as a distinct aggregation method that strictly measures the match against the reference set's overall average profile.

This specific design yields several highly practical properties: (1) the score rarely collapses, even during strict per-sentence evaluation; (2) penalties for both excessive length (too long) and missing content (too short) are cleanly handled by a single mathematical expression; (3) the theoretical meaning of averaging multiple references is mathematically sound and clear; (4) operating on characters makes it relatively robust to minor notational or typographical differences; and (5) the mean-length version's reference side can be precomputed, making it well suited to large-scale evaluation. While BLEU and chrF remain powerful and established standard metrics, charsim is a convenient alternative when you prioritize smooth sentence-level scoring, clear multi-reference averaging, and unified handling of length differences within a single expression. Note that this document strictly establishes charsim's mathematical definition and theoretical properties; empirically comparing its scores against BLEU and chrF on real-world datasets, and verifying its correlation with human judgments, is left for future work.

10. References

Bibliographic details for the standard surface-similarity metrics discussed in this article (BLEU, chrF, chrF++) and for sacreBLEU, the implementation used for our verification, are provided below.

[1] Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, "BLEU: a Method for Automatic Evaluation of Machine Translation," ACL, 2002.
[2] Maja Popović, "chrF: character n-gram F-score for automatic MT evaluation," WMT, 2015.
[3] Maja Popović, "chrF++: words helping character n-grams," WMT, 2017.
[4] Matt Post, "A Call for Clarity in Reporting BLEU Scores," WMT, 2018. (sacreBLEU)

As precursors to this new metric, the following papers from our own group implemented and studied earlier iterations of character n-gram-based surface-matching evaluation.

[5] Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami, "pfgen-bench: A Text Generation Performance Evaluation Benchmark for Japanese Pretrained Models," Jxiv, 2024. An early evaluation implementation for a Q&A benchmark. It simply summed character n-gram frequencies over the reference-answer set and did not treat repeated occurrences of the same n-gram as a clipped multiset overlap, which the current $\min(c_r,c_t)$ formulation resolves.
[6] Kentaro Imajo, Masanori Hirano, "A Study of Automatic Evaluation Methods for High-Precision Translation Models," Jxiv, 2026. This work treated character n-gram counts as $f_{w,i}$ but designed the length correction as a separate penalty against the median length of the reference translations. This differs from the form used in this article, which aggregates overlap and length with $1/n$ weights and normalizes by $\max(L_t,\bar L_r)$.

Last Modified: July 2, 2026

Order \(n\)	1	2	3	4	5
\(m_n\)	2	2	1	0	0
\(\tfrac1n m_n\)	2	1	1/3	0	0
\(\ell_{t,n}\)	2	3	2	1	0
\(\tfrac1n \ell_{t,n}\)	2	3/2	2/3	1/4	0
\(\ell_{r,n}\)	3	4	3	2	1
\(\tfrac1n \ell_{r,n}\)	3	2	1	1/2	1/5

Candidate \(t\)	Description	Active side	\(N'\)	\(\operatorname{charsim}\)
`cat`	exact match	\(P=R=1\)	5	1.000
`cats`	one character over	precision	6	0.592
`ca`	one character short	recall	5	0.498
`catcat`	double-length padding	precision	8	0.449

Weight occupied by orders > 4	cap 8	16	32	64
charsim (\(1/n\) weight)	0.23	0.38	0.49	0.56
BLEU / chrF (uniform)	0.50	0.75	0.88	0.94