Quality Assurance in BMAD
Build a comprehensive AI quality assurance framework covering testing, evaluation metrics, regression testing, human evaluation, bias detection, safety testing, and production monitoring.
AI Quality Assurance Framework
Traditional QA tests for correctness — the output either matches the expected result or it does not. AI QA tests for quality on a spectrum, where outputs may be acceptable, good, or excellent, and the same input can produce different outputs each time.
Testing AI Outputs
BMAD defines three layers of AI testing:
-
Automated Evaluation
Run prompts against labeled test datasets and measure accuracy, completeness, and format compliance automatically. This is your first line of defense.
-
LLM-as-Judge
Use a separate AI model to evaluate the quality of another model's output. Cost-effective for large test sets, though less reliable than human evaluation.
-
Human Evaluation
Subject matter experts review a sample of AI outputs for quality, accuracy, and appropriateness. The gold standard, but expensive and slow.
class AIEvaluator: def evaluate(self, prompt, test_dataset): results = [] for case in test_dataset: output = llm.call(prompt, case.input) score = { "accuracy": self.check_accuracy( output, case.expected ), "format": self.check_format( output, case.schema ), "latency": output.latency_ms, "tokens": output.total_tokens, } results.append(score) return { "accuracy": mean(r["accuracy"] for r in results), "format_compliance": mean(r["format"] for r in results), "avg_latency": mean(r["latency"] for r in results), "total_tests": len(results), }
Evaluation Metrics
| Metric | What It Measures | Target Range |
|---|---|---|
| Accuracy | Percentage of outputs matching expected results | 85-99% depending on use case |
| Hallucination Rate | Percentage of outputs containing fabricated information | <5% for factual tasks |
| Latency (p50/p95/p99) | Response time at different percentiles | Varies by feature requirements |
| Format Compliance | Percentage of outputs matching expected structure | >98% |
| Cost per Request | Average API cost per inference call | Set per business requirements |
| Consistency | How similar are outputs for the same input across runs | >90% for deterministic tasks |
Regression Testing for Prompts
When you update a prompt, ensure the new version does not degrade quality on previously passing cases:
name: Prompt Regression Test on: pull_request: paths: - 'prompts/**' jobs: regression-test: steps: - name: Run evaluation suite run: python eval/run_tests.py --prompt $CHANGED_PROMPT - name: Compare with baseline run: python eval/compare.py --threshold 0.02 # Fail if accuracy drops more than 2%
Human Evaluation Workflows
Structure human evaluation for consistency and efficiency:
Rating Rubrics
Define clear scoring criteria (1-5 scale) for each quality dimension. Train evaluators on the rubric before they begin.
Inter-Rater Agreement
Have multiple evaluators rate the same outputs. Measure agreement (Cohen's kappa) to ensure consistency.
Sampling Strategy
Evaluate a representative sample (100-500 outputs) rather than every output. Stratify by input type and difficulty.
Continuous Sampling
In production, randomly sample outputs for ongoing human review. Set up alerts when quality scores trend downward.
Bias Detection
Test your AI system for unfair bias across demographic groups, sensitive topics, and edge cases:
def test_demographic_parity(prompt, test_pairs): """Test if outputs differ unfairly across groups.""" results = {} for group, inputs in test_pairs.items(): outputs = [llm.call(prompt, inp) for inp in inputs] results[group] = { "positive_rate": count_positive(outputs) / len(outputs), "avg_sentiment": mean_sentiment(outputs), "avg_length": mean(len(o) for o in outputs), } # Flag significant differences between groups max_diff = max_parity_difference(results) assert max_diff < 0.1, \ f"Parity difference {max_diff} exceeds threshold"
Safety Testing
Ensure your AI system handles adversarial inputs and edge cases safely:
- Prompt injection testing: Verify the system resists attempts to override system prompts or instructions.
- Harmful content filtering: Test that the system refuses to generate harmful, illegal, or inappropriate content.
- Data leakage testing: Ensure the system does not reveal sensitive training data, API keys, or system prompts.
- Boundary testing: Test with extremely long inputs, empty inputs, special characters, and multiple languages.
Monitoring in Production
Set up dashboards and alerts for ongoing AI quality monitoring:
Real-Time Metrics: - Request volume (requests/min) - Error rate (% of failed calls) - Latency (p50, p95, p99) - Token usage (input + output) - Cost accumulation ($/hour) Quality Metrics (hourly): - Format compliance rate - Average output length - User feedback scores (thumbs up/down) - Escalation rate to humans Alerts: - Error rate > 5% for 5 minutes - p95 latency > 10 seconds - Daily cost exceeds budget by 20% - User satisfaction drops below 80% - Model provider reports degradation
Lilly Tech Systems