Advanced

Quality Assurance in BMAD

Build a comprehensive AI quality assurance framework covering testing, evaluation metrics, regression testing, human evaluation, bias detection, safety testing, and production monitoring.

AI Quality Assurance Framework

Traditional QA tests for correctness — the output either matches the expected result or it does not. AI QA tests for quality on a spectrum, where outputs may be acceptable, good, or excellent, and the same input can produce different outputs each time.

💡

Key insight: AI QA is statistical, not binary. You are measuring the probability that your AI system produces acceptable output, not proving it always does. A 95% accuracy rate means 1 in 20 outputs may be problematic.

Testing AI Outputs

BMAD defines three layers of AI testing:

Automated Evaluation

Run prompts against labeled test datasets and measure accuracy, completeness, and format compliance automatically. This is your first line of defense.
LLM-as-Judge

Use a separate AI model to evaluate the quality of another model's output. Cost-effective for large test sets, though less reliable than human evaluation.
Human Evaluation

Subject matter experts review a sample of AI outputs for quality, accuracy, and appropriateness. The gold standard, but expensive and slow.

Python - Automated Evaluation

class AIEvaluator:
    def evaluate(self, prompt, test_dataset):
        results = []
        for case in test_dataset:
            output = llm.call(prompt, case.input)
            score = {
                "accuracy": self.check_accuracy(
                    output, case.expected
                ),
                "format": self.check_format(
                    output, case.schema
                ),
                "latency": output.latency_ms,
                "tokens": output.total_tokens,
            }
            results.append(score)

        return {
            "accuracy": mean(r["accuracy"] for r in results),
            "format_compliance": mean(r["format"] for r in results),
            "avg_latency": mean(r["latency"] for r in results),
            "total_tests": len(results),
        }

Evaluation Metrics

Metric	What It Measures	Target Range
Accuracy	Percentage of outputs matching expected results	85-99% depending on use case
Hallucination Rate	Percentage of outputs containing fabricated information	<5% for factual tasks
Latency (p50/p95/p99)	Response time at different percentiles	Varies by feature requirements
Format Compliance	Percentage of outputs matching expected structure	>98%
Cost per Request	Average API cost per inference call	Set per business requirements
Consistency	How similar are outputs for the same input across runs	>90% for deterministic tasks

Regression Testing for Prompts

When you update a prompt, ensure the new version does not degrade quality on previously passing cases:

CI Pipeline - Prompt Regression Test

name: Prompt Regression Test
on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  regression-test:
    steps:
      - name: Run evaluation suite
        run: python eval/run_tests.py --prompt $CHANGED_PROMPT
      - name: Compare with baseline
        run: python eval/compare.py --threshold 0.02
        # Fail if accuracy drops more than 2%

Human Evaluation Workflows

Structure human evaluation for consistency and efficiency:

📋

Rating Rubrics

Define clear scoring criteria (1-5 scale) for each quality dimension. Train evaluators on the rubric before they begin.

👥

Inter-Rater Agreement

Have multiple evaluators rate the same outputs. Measure agreement (Cohen's kappa) to ensure consistency.

📈

Sampling Strategy

Evaluate a representative sample (100-500 outputs) rather than every output. Stratify by input type and difficulty.

🔄

Continuous Sampling

In production, randomly sample outputs for ongoing human review. Set up alerts when quality scores trend downward.

Bias Detection

Test your AI system for unfair bias across demographic groups, sensitive topics, and edge cases:

Python - Bias Testing

def test_demographic_parity(prompt, test_pairs):
    """Test if outputs differ unfairly across groups."""
    results = {}
    for group, inputs in test_pairs.items():
        outputs = [llm.call(prompt, inp) for inp in inputs]
        results[group] = {
            "positive_rate": count_positive(outputs) / len(outputs),
            "avg_sentiment": mean_sentiment(outputs),
            "avg_length": mean(len(o) for o in outputs),
        }

    # Flag significant differences between groups
    max_diff = max_parity_difference(results)
    assert max_diff < 0.1, \
        f"Parity difference {max_diff} exceeds threshold"

Safety Testing

Ensure your AI system handles adversarial inputs and edge cases safely:

Prompt injection testing: Verify the system resists attempts to override system prompts or instructions.
Harmful content filtering: Test that the system refuses to generate harmful, illegal, or inappropriate content.
Data leakage testing: Ensure the system does not reveal sensitive training data, API keys, or system prompts.
Boundary testing: Test with extremely long inputs, empty inputs, special characters, and multiple languages.

Monitoring in Production

Set up dashboards and alerts for ongoing AI quality monitoring:

Monitoring Dashboard Metrics

Real-Time Metrics:
  - Request volume (requests/min)
  - Error rate (% of failed calls)
  - Latency (p50, p95, p99)
  - Token usage (input + output)
  - Cost accumulation ($/hour)

Quality Metrics (hourly):
  - Format compliance rate
  - Average output length
  - User feedback scores (thumbs up/down)
  - Escalation rate to humans

Alerts:
  - Error rate > 5% for 5 minutes
  - p95 latency > 10 seconds
  - Daily cost exceeds budget by 20%
  - User satisfaction drops below 80%
  - Model provider reports degradation

✅

Pro tip: Save a sample of production inputs and outputs daily. Use these to build your regression test dataset over time. Real-world data is invaluable for catching edge cases you would not think to test for.

← Previous Prompt Patterns Next → Best Practices