LLM Evaluation — Benchmarking Fine-Tuned Models
Evaluating a fine-tuned LLM requires domain-specific benchmarks and the Eleuther AI Language Model Evaluation Harness (lm-eval), the standard tool used by most open-source model releases. LLM-as-judge provides qualitative assessment for open-ended generation.
Model Evaluation with lm-eval-harness
# ELEUTHER AI LM EVALUATION HARNESS
# pip install lm-eval
# Run from command line:
# lm_eval --model hf --model_args pretrained=./llama3-medical-merged
# --tasks mmlu_clinical_knowledge,truthfulqa_mc1
# --device cuda:0 --batch_size 8 --output_path ./eval_results
import lm_eval
results = lm_eval.simple_evaluate(
model="hf",
model_args="pretrained=./llama3-medical-merged",
tasks=["mmlu_clinical_knowledge", "mmlu_medical_genetics", "truthfulqa_mc1"],
batch_size=8,
device="cuda:0",
)
for task_name, task_results in results["results"].items():
print(f"{task_name}: acc={task_results['acc,none']:.2%}")
# DOMAIN-SPECIFIC EVALUATION
from openai import OpenAI
from datasets import load_dataset
def evaluate_medical_qa(model_fn, n_samples: int = 100) -> dict:
# Evaluate on MedQA benchmark (USMLE-style questions)
medqa = load_dataset("GBaker/MedQA-USMLE-4-options", split="test").select(range(n_samples))
correct = 0
for example in medqa:
question = example["question"]
options = "\n".join(f"{k}: {v}" for k, v in example["options"].items())
correct_answer = example["answer"]
model_answer = model_fn(f"{question}\n\nOptions:\n{options}\n\nAnswer (A/B/C/D):")
if correct_answer[0].upper() in model_answer.upper()[:5]:
correct += 1
return {"medqa_accuracy": correct / n_samples, "n_samples": n_samples}
print("Expected results after fine-tuning on medical QA:")
print(" Base LLaMA-3-8B-Instruct: ~57% on MedQA")
print(" QLoRA fine-tuned (300 ex): ~64% on MedQA (+7% with 300 examples)")
print(" QLoRA fine-tuned (5K ex): ~71% on MedQA (+14% with 5K examples)")Tip
Tip
Practice LLM Evaluation Benchmarking FineTuned Models in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of LLM Evaluation Benchmarking FineTuned Models from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with LLM Evaluation Benchmarking FineTuned Models is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.