LLM Evaluation Challenges
Why evaluating LLMs is uniquely hard. Part of the LLM Evaluation & Testing course at AI School by Lilly Tech Systems.
The LLM Evaluation Crisis
Evaluating large language models is one of the hardest unsolved problems in AI testing. Traditional ML metrics like accuracy, precision, and recall are insufficient for LLMs because the outputs are open-ended text. There is rarely a single "correct" answer. A response can be factually accurate but poorly written, or fluent but containing subtle hallucinations. The evaluation space is multi-dimensional and often subjective.
This challenge has real consequences. Teams deploy LLMs to production without rigorous evaluation, leading to hallucinations reaching end users, subtle biases in generated content, inconsistent quality across different input types, and regression when prompts or models are updated. This course gives you practical tools and frameworks to address each of these challenges.
Why LLM Evaluation Is Uniquely Hard
LLM evaluation faces challenges that do not exist for traditional ML models:
- Open-ended outputs — There are multiple valid responses to most prompts, making automated comparison difficult
- Multi-dimensional quality — A response must be evaluated on accuracy, relevance, helpfulness, safety, tone, and more
- Context sensitivity — The same model can produce very different quality responses depending on the prompt phrasing
- Emergent behaviors — LLMs exhibit capabilities and failure modes that were not explicitly programmed or trained for
- Benchmark contamination — Popular benchmarks may have leaked into training data, inflating scores
# Basic LLM evaluation framework
from typing import List, Dict
import json
class LLMEvaluator:
def __init__(self, model, eval_criteria: List[str]):
self.model = model
self.criteria = eval_criteria
self.results = []
def evaluate_response(self, prompt: str, response: str,
reference: str = None) -> Dict:
scores = {}
# Relevance: Does the response address the prompt?
scores['relevance'] = self._score_relevance(prompt, response)
# Faithfulness: Is the response factually grounded?
if reference:
scores['faithfulness'] = self._score_faithfulness(
response, reference
)
# Coherence: Is the response well-structured and logical?
scores['coherence'] = self._score_coherence(response)
# Safety: Does the response contain harmful content?
scores['safety'] = self._score_safety(response)
result = {
'prompt': prompt,
'response': response,
'scores': scores,
'overall': sum(scores.values()) / len(scores)
}
self.results.append(result)
return result
def _score_relevance(self, prompt, response):
# Implementation depends on your scoring method
# Options: embedding similarity, LLM-as-judge, keyword matching
pass
def _score_faithfulness(self, response, reference):
# Check if claims in response are supported by reference
pass
def _score_coherence(self, response):
# Evaluate structural quality and logical flow
pass
def _score_safety(self, response):
# Check for harmful, biased, or inappropriate content
pass
The Evaluation Stack
A robust LLM evaluation strategy combines multiple approaches:
- Automated metrics — BLEU, ROUGE, BERTScore for quantitative baseline (fast, cheap, but limited)
- LLM-as-judge — Using a stronger LLM to evaluate the target model (scalable, reasonable quality)
- Human evaluation — Expert annotators scoring responses (gold standard, but expensive and slow)
- Behavioral testing — Checklist-style tests for specific capabilities and failure modes
- Red teaming — Adversarial testing to find safety and robustness issues
Evaluation Dimensions for Production LLMs
When evaluating LLMs for production deployment, assess these dimensions:
- Factual accuracy — Does the model generate correct information?
- Hallucination rate — How often does it state things that are not true?
- Instruction following — Does it follow the prompt's instructions precisely?
- Consistency — Does it give similar answers to similar questions?
- Robustness — Does quality degrade with slightly rephrased inputs?
- Safety — Does it refuse harmful requests and avoid generating harmful content?
- Latency — Does it respond within acceptable time limits?
Lilly Tech Systems