Technology Encyclopedia Home >How is the accuracy of speech recognition measured?

How is the accuracy of speech recognition measured?

The accuracy of speech recognition is typically measured by comparing the system's output (transcribed text) to a reference or ground truth transcript (the correct, human-annotated text). The most common metrics used are:

  1. Word Error Rate (WER) – The most widely used metric. It calculates the minimum number of operations (insertions, deletions, and substitutions) needed to change the recognized text into the reference text, divided by the total number of words in the reference.

    • Formula: WER = (Insertions + Deletions + Substitutions) / Total Words in Reference
    • Example: If the reference is "hello world" and the recognition output is "hello word," there is 1 substitution (world → word). WER = 1/2 = 50%.
  2. Character Error Rate (CER) – Similar to WER but measures errors at the character level, useful for languages with complex words or when evaluating short phrases.

    • Formula: CER = (Insertions + Deletions + Substitutions) / Total Characters in Reference
  3. Accuracy (or Match Rate) – The percentage of correctly recognized words or characters.

    • Formula: Accuracy = (Correctly Recognized Words / Total Words) × 100

Example in Real Use:

  • A speech recognition system transcribes a meeting recording. The reference transcript has 1,000 words, but the system makes 10 insertions, 5 deletions, and 15 substitutions.
    • WER = (10 + 5 + 15) / 1,000 = 30/1000 = 3% (good accuracy).
    • If WER is high (e.g., 20%), the system may need improvements in noise handling or language modeling.

For cloud-based speech recognition, services like Tencent Cloud ASR (Automatic Speech Recognition) provide high-accuracy transcription with low WER, optimized for different industries (e.g., finance, healthcare). It supports real-time and batch processing, with metrics to monitor performance.