The BLEU (Bilingual Evaluation Understudy) algorithm has several limitations in machine translation evaluation:
N-gram Overemphasis: BLEU primarily relies on n-gram precision (typically 1- to 4-grams), which may not accurately reflect translation quality. A translation with high n-gram matches but poor coherence or meaning can still score well.
Lack of Semantic Understanding: BLEU does not assess semantic correctness or fluency. A translation might have correct word order and terminology but fail to convey the intended meaning, yet still receive a high BLEU score.
Reference Dependency: BLEU heavily depends on reference translations. If the provided references are not diverse or representative, the score may not reflect the true quality of the translation.
Penalizing Fluency: BLEU penalizes translations that deviate from the references, even if they are grammatically correct and natural. For example, a human-like rephrasing might score lower than a less fluent but reference-matching translation.
Short Sentence Bias: BLEU may not perform well on short sentences, as the chance of random n-gram matches increases, leading to inflated scores.
Example:
In machine translation, tools like Tencent Cloud's Machine Translation (MT) service can complement BLEU by providing neural-based evaluations that consider fluency and context, offering a more holistic assessment.