LLM Evaluation — Measuring Output Quality
LLM evaluation is harder than traditional ML metrics. You can't just compute accuracy when outputs are free-form text. Modern approaches: ROUGE/BLEU for summarization/translation, LLM-as-judge for quality assessment, RAGAS for RAG pipelines, and human evaluation as ground truth.
LLM Evaluation Framework
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# LLM EVALUATION METRICS & APPROACHES
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# ── 1. ROUGE (for summarization) ──────────────────────
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
reference = "The cat sat on the mat and fell asleep."
generated = "The cat lay down on the mat and went to sleep."
scores = scorer.score(reference, generated)
for metric, score in scores.items():
print(f" {metric:8s}: P={score.precision:.3f} R={score.recall:.3f} F1={score.fmeasure:.3f}")
# ROUGE-1: unigram overlap (word-level)
# ROUGE-2: bigram overlap (phrase-level)
# ROUGE-L: longest common subsequence (sentence structure)
# ── 2. LLM-AS-JUDGE — GPT-4 evaluates GPT-4-mini output ──
from openai import OpenAI
client = OpenAI()
def llm_judge(question: str, answer: str, criteria: list[str]) -> dict:
"""Use GPT-4o to evaluate quality of an answer."""
criteria_str = "\n".join(f"{i+1}. {c}" for i, c in enumerate(criteria))
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"""Evaluate this answer on a scale of 1-10 for each criterion.
Question: {question}
Answer: {answer}
Criteria:
{criteria_str}
Respond as JSON: {{"scores": {{}}, "overall": 0, "reasoning": ""}}"""
}],
response_format={"type": "json_object"},
temperature=0,
)
import json
return json.loads(response.choices[0].message.content)
# ── 3. RAGAS — Evaluation for RAG pipelines ─────────────
# pip install ragas
from ragas import evaluate
from ragas.metrics import (
faithfulness, # is the answer grounded in the retrieved context?
answer_relevancy, # is the answer relevant to the question?
context_precision, # did we retrieve the right documents?
context_recall, # did we retrieve ALL relevant documents?
)
# RAGAS eval dataset format:
eval_data = {
"question": ["What is the return policy?"],
"answer": ["Items can be returned within 30 days with a receipt."],
"contexts": [["Our policy: items can be returned within 30 days of purchase with original receipt."]],
"ground_truth": ["30-day return policy with receipt required."]
}
# from datasets import Dataset
# ragas_dataset = Dataset.from_dict(eval_data)
# results = evaluate(ragas_dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
# print(results) # scores for each metric
# ── 4. HALLUCINATION DETECTION ─────────────────────────
def check_hallucination(question: str, answer: str, source_docs: list[str]) -> bool:
"""Use LLM to detect if the answer contains information not in sources."""
context = "\n".join(source_docs)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"""Does this answer contain claims NOT supported by the context?
Context: {context}
Answer: {answer}
Respond only: HALLUCINATION or GROUNDED"""
}],
temperature=0, max_tokens=10,
)
return "HALLUCINATION" in response.choices[0].message.content.upper()Tip
Tip
Practice LLM Evaluation Measuring Output Quality in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of LLM Evaluation Measuring Output Quality from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with LLM Evaluation Measuring Output Quality is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.