LLM Evaluation — Benchmarking Fine-Tuned Models

Evaluating a fine-tuned LLM requires domain-specific benchmarks and the Eleuther AI Language Model Evaluation Harness (lm-eval), the standard tool used by most open-source model releases. LLM-as-judge provides qualitative assessment for open-ended generation.

15 min•By Priygop Team•Updated 2026

Model Evaluation with lm-eval-harness

# ELEUTHER AI LM EVALUATION HARNESS
# pip install lm-eval

# Run from command line:
# lm_eval --model hf --model_args pretrained=./llama3-medical-merged
#   --tasks mmlu_clinical_knowledge,truthfulqa_mc1
#   --device cuda:0 --batch_size 8 --output_path ./eval_results

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=./llama3-medical-merged",
    tasks=["mmlu_clinical_knowledge", "mmlu_medical_genetics", "truthfulqa_mc1"],
    batch_size=8,
    device="cuda:0",
)

for task_name, task_results in results["results"].items():
    print(f"{task_name}: acc={task_results['acc,none']:.2%}")

# DOMAIN-SPECIFIC EVALUATION
from openai import OpenAI
from datasets import load_dataset

def evaluate_medical_qa(model_fn, n_samples: int = 100) -> dict:
    # Evaluate on MedQA benchmark (USMLE-style questions)
    medqa = load_dataset("GBaker/MedQA-USMLE-4-options", split="test").select(range(n_samples))
    correct = 0

    for example in medqa:
        question = example["question"]
        options = "\n".join(f"{k}: {v}" for k, v in example["options"].items())
        correct_answer = example["answer"]
        model_answer = model_fn(f"{question}\n\nOptions:\n{options}\n\nAnswer (A/B/C/D):")
        if correct_answer[0].upper() in model_answer.upper()[:5]:
            correct += 1

    return {"medqa_accuracy": correct / n_samples, "n_samples": n_samples}

print("Expected results after fine-tuning on medical QA:")
print("  Base LLaMA-3-8B-Instruct:   ~57% on MedQA")
print("  QLoRA fine-tuned (300 ex):  ~64% on MedQA  (+7% with 300 examples)")
print("  QLoRA fine-tuned (5K ex):   ~71% on MedQA  (+14% with 5K examples)")

Tip

Practice LLM Evaluation Benchmarking FineTuned Models in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Technical diagram.

Practice Task

Note

Practice Task — (1) Write a working example of LLM Evaluation Benchmarking FineTuned Models from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with LLM Evaluation Benchmarking FineTuned Models is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module

Model Evaluation with lm-eval-harness

# ELEUTHER AI LM EVALUATION HARNESS
# pip install lm-eval

# Run from command line:
# lm_eval --model hf --model_args pretrained=./llama3-medical-merged
#   --tasks mmlu_clinical_knowledge,truthfulqa_mc1
#   --device cuda:0 --batch_size 8 --output_path ./eval_results

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=./llama3-medical-merged",
    tasks=["mmlu_clinical_knowledge", "mmlu_medical_genetics", "truthfulqa_mc1"],
    batch_size=8,
    device="cuda:0",
)

for task_name, task_results in results["results"].items():
    print(f"{task_name}: acc={task_results['acc,none']:.2%}")

# DOMAIN-SPECIFIC EVALUATION
from openai import OpenAI
from datasets import load_dataset

def evaluate_medical_qa(model_fn, n_samples: int = 100) -> dict:
    # Evaluate on MedQA benchmark (USMLE-style questions)
    medqa = load_dataset("GBaker/MedQA-USMLE-4-options", split="test").select(range(n_samples))
    correct = 0

    for example in medqa:
        question = example["question"]
        options = "\n".join(f"{k}: {v}" for k, v in example["options"].items())
        correct_answer = example["answer"]
        model_answer = model_fn(f"{question}\n\nOptions:\n{options}\n\nAnswer (A/B/C/D):")
        if correct_answer[0].upper() in model_answer.upper()[:5]:
            correct += 1

    return {"medqa_accuracy": correct / n_samples, "n_samples": n_samples}

print("Expected results after fine-tuning on medical QA:")
print("  Base LLaMA-3-8B-Instruct:   ~57% on MedQA")
print("  QLoRA fine-tuned (300 ex):  ~64% on MedQA  (+7% with 300 examples)")
print("  QLoRA fine-tuned (5K ex):   ~71% on MedQA  (+14% with 5K examples)")

Tip

Diagram

Loading diagram…

Technical diagram.

Topics in This Module