Technology Encyclopedia Home >What are the limitations of the BLEU algorithm in machine translation evaluation?

What are the limitations of the BLEU algorithm in machine translation evaluation?

The BLEU (Bilingual Evaluation Understudy) algorithm has several limitations in machine translation evaluation:

  1. N-gram Overemphasis: BLEU primarily relies on n-gram precision (typically 1- to 4-grams), which may not accurately reflect translation quality. A translation with high n-gram matches but poor coherence or meaning can still score well.

  2. Lack of Semantic Understanding: BLEU does not assess semantic correctness or fluency. A translation might have correct word order and terminology but fail to convey the intended meaning, yet still receive a high BLEU score.

  3. Reference Dependency: BLEU heavily depends on reference translations. If the provided references are not diverse or representative, the score may not reflect the true quality of the translation.

  4. Penalizing Fluency: BLEU penalizes translations that deviate from the references, even if they are grammatically correct and natural. For example, a human-like rephrasing might score lower than a less fluent but reference-matching translation.

  5. Short Sentence Bias: BLEU may not perform well on short sentences, as the chance of random n-gram matches increases, leading to inflated scores.

Example:

  • Reference: "The cat is on the mat."
  • Translation A (Human-like): "The feline is lying on the rug." (Semantically correct but may score lower due to different wording.)
  • Translation B (Literal but awkward): "The cat on the mat is." (May score higher if it matches n-grams like "the cat" and "on the mat.")

In machine translation, tools like Tencent Cloud's Machine Translation (MT) service can complement BLEU by providing neural-based evaluations that consider fluency and context, offering a more holistic assessment.