Chat Templates — Formatting Data for Instruction Tuning
Every chat model has a specific token format for system/user/assistant turns. Mismatch means poor performance regardless of data quality. The HuggingFace tokenizer apply_chat_template method handles this automatically for all major models.
Chat Template Formatting
from transformers import AutoTokenizer
from datasets import Dataset
# USE apply_chat_template -- always correct for any model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
messages = [
{"role": "system", "content": "You are a medical information assistant. Be accurate."},
{"role": "user", "content": "What are the symptoms of type 2 diabetes?"},
{"role": "assistant", "content": "Common symptoms: frequent urination, increased thirst, fatigue, blurred vision."},
]
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False
)
print(formatted)
# DATASET FORMATTING FUNCTION
def format_medical_qa(example: dict, tok) -> dict:
# Single-line docstring: Format a QA pair into LLaMA 3 chat format.
conversation = [
{"role": "system", "content": "You are an expert medical information assistant."},
{"role": "user", "content": example["question"]},
{"role": "assistant", "content": example["answer"]},
]
text = tok.apply_chat_template(conversation, tokenize=False)
return {"text": text}
sample_data = [
{"question": "What is HbA1c?", "answer": "HbA1c measures average blood glucose over 2-3 months. Below 5.7% = normal."},
{"question": "How does insulin resistance develop?", "answer": "Insulin resistance occurs when cells don't respond normally to insulin."},
{"question": "Difference between Type 1 and Type 2 diabetes?", "answer": "Type 1: autoimmune. Type 2: lifestyle-related insulin resistance."},
] * 100
dataset = Dataset.from_list(sample_data)
formatted_ds = dataset.map(lambda x: format_medical_qa(x, tokenizer), remove_columns=dataset.column_names)
print(f"Training examples: {len(formatted_ds)}")Tip
Tip
Practice Chat Templates Formatting Data for Instruction Tuning in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of Chat Templates Formatting Data for Instruction Tuning from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Chat Templates Formatting Data for Instruction Tuning is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.