Best Practices Intermediate
Getting accurate transcription in production requires more than just calling an API. This lesson covers audio preprocessing, custom vocabularies, post-processing pipelines, error handling, and deployment strategies that maximize accuracy and reliability.
Audio Preprocessing
The quality of your audio input is the single biggest factor affecting transcription accuracy. Apply these preprocessing steps before sending audio to any ASR system:
Python
from pydub import AudioSegment import noisereduce as nr import numpy as np # Load and normalize audio audio = AudioSegment.from_file("raw_audio.mp3") # Convert to mono, 16kHz (optimal for most ASR) audio = audio.set_channels(1).set_frame_rate(16000) # Normalize volume audio = audio.normalize() # Apply noise reduction samples = np.array(audio.get_array_of_samples(), dtype=np.float32) reduced = nr.reduce_noise(y=samples, sr=16000) # Export clean audio clean_audio = AudioSegment( reduced.astype(np.int16).tobytes(), frame_rate=16000, sample_width=2, channels=1 ) clean_audio.export("clean_audio.wav", format="wav")
Audio Quality Checklist
| Factor | Recommendation | Impact on WER |
|---|---|---|
| Sample rate | 16kHz for speech (8kHz for telephony) | High |
| Channels | Mono (convert stereo to mono) | Medium |
| Noise | Apply noise reduction for noisy environments | High |
| Volume | Normalize to consistent levels | Medium |
| Format | WAV (lossless) or FLAC over MP3 | Low-Medium |
Post-Processing the Transcript
Raw ASR output often needs cleanup before it is usable:
Python
import re def post_process_transcript(text): # Fix common ASR errors with domain-specific terms replacements = { "eye school": "AI School", "chat gee pee tee": "ChatGPT", "pie torch": "PyTorch", } for wrong, right in replacements.items(): text = re.sub(wrong, right, text, flags=re.IGNORECASE) # Remove filler words fillers = [r"\bum\b", r"\buh\b", r"\byou know\b", r"\blike\b"] for filler in fillers: text = re.sub(filler, "", text, flags=re.IGNORECASE) # Clean up extra spaces text = re.sub(r"\s+", " ", text).strip() return text
Measuring Accuracy
Python
from jiwer import wer, cer reference = "The quick brown fox jumps over the lazy dog" hypothesis = "The quick brown box jumps over the lazy dog" word_error_rate = wer(reference, hypothesis) char_error_rate = cer(reference, hypothesis) print(f"WER: {word_error_rate:.1%}") # 11.1% print(f"CER: {char_error_rate:.1%}") # 2.6%
Production Deployment Tips
Key Production Guidelines:
- Chunk long audio — Split files longer than 10 minutes into chunks to avoid timeouts and memory issues
- Implement retries — Cloud APIs can have transient failures; use exponential backoff
- Cache results — Store transcripts to avoid re-processing the same audio
- Monitor WER — Track word error rate over time to catch quality regressions
- Use async processing — For batch jobs, use task queues (Celery, SQS) to handle audio in parallel
- Consider privacy — For sensitive audio, use local Whisper instead of cloud APIs
Course Complete!
You now have the skills to build production-grade speech-to-text applications. Return to the course overview to review any lessons or explore other AI School courses.
← Course Overview
Lilly Tech Systems