Best Practices Intermediate

Getting accurate transcription in production requires more than just calling an API. This lesson covers audio preprocessing, custom vocabularies, post-processing pipelines, error handling, and deployment strategies that maximize accuracy and reliability.

Audio Preprocessing

The quality of your audio input is the single biggest factor affecting transcription accuracy. Apply these preprocessing steps before sending audio to any ASR system:

Python

from pydub import AudioSegment
import noisereduce as nr
import numpy as np

# Load and normalize audio
audio = AudioSegment.from_file("raw_audio.mp3")

# Convert to mono, 16kHz (optimal for most ASR)
audio = audio.set_channels(1).set_frame_rate(16000)

# Normalize volume
audio = audio.normalize()

# Apply noise reduction
samples = np.array(audio.get_array_of_samples(), dtype=np.float32)
reduced = nr.reduce_noise(y=samples, sr=16000)

# Export clean audio
clean_audio = AudioSegment(
    reduced.astype(np.int16).tobytes(),
    frame_rate=16000, sample_width=2, channels=1
)
clean_audio.export("clean_audio.wav", format="wav")

Audio Quality Checklist

Factor	Recommendation	Impact on WER
Sample rate	16kHz for speech (8kHz for telephony)	High
Channels	Mono (convert stereo to mono)	Medium
Noise	Apply noise reduction for noisy environments	High
Volume	Normalize to consistent levels	Medium
Format	WAV (lossless) or FLAC over MP3	Low-Medium

Post-Processing the Transcript

Raw ASR output often needs cleanup before it is usable:

Python

import re

def post_process_transcript(text):
    # Fix common ASR errors with domain-specific terms
    replacements = {
        "eye school": "AI School",
        "chat gee pee tee": "ChatGPT",
        "pie torch": "PyTorch",
    }
    for wrong, right in replacements.items():
        text = re.sub(wrong, right, text, flags=re.IGNORECASE)

    # Remove filler words
    fillers = [r"\bum\b", r"\buh\b", r"\byou know\b", r"\blike\b"]
    for filler in fillers:
        text = re.sub(filler, "", text, flags=re.IGNORECASE)

    # Clean up extra spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text

Measuring Accuracy

Python

from jiwer import wer, cer

reference = "The quick brown fox jumps over the lazy dog"
hypothesis = "The quick brown box jumps over the lazy dog"

word_error_rate = wer(reference, hypothesis)
char_error_rate = cer(reference, hypothesis)

print(f"WER: {word_error_rate:.1%}")   # 11.1%
print(f"CER: {char_error_rate:.1%}")   # 2.6%

Production Deployment Tips

Key Production Guidelines:

Chunk long audio — Split files longer than 10 minutes into chunks to avoid timeouts and memory issues
Implement retries — Cloud APIs can have transient failures; use exponential backoff
Cache results — Store transcripts to avoid re-processing the same audio
Monitor WER — Track word error rate over time to catch quality regressions
Use async processing — For batch jobs, use task queues (Celery, SQS) to handle audio in parallel
Consider privacy — For sensitive audio, use local Whisper instead of cloud APIs

Course Complete!

You now have the skills to build production-grade speech-to-text applications. Return to the course overview to review any lessons or explore other AI School courses.

← Course Overview

← Speaker Diarization Course Overview →