Chat Templates — Formatting Data for Instruction Tuning

Every chat model has a specific token format for system/user/assistant turns. Mismatch means poor performance regardless of data quality. The HuggingFace tokenizer apply_chat_template method handles this automatically for all major models.

20 min•By Priygop Team•Updated 2026

Chat Template Formatting

from transformers import AutoTokenizer
from datasets import Dataset

# USE apply_chat_template -- always correct for any model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a medical information assistant. Be accurate."},
    {"role": "user", "content": "What are the symptoms of type 2 diabetes?"},
    {"role": "assistant", "content": "Common symptoms: frequent urination, increased thirst, fatigue, blurred vision."},
]

formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=False
)
print(formatted)

# DATASET FORMATTING FUNCTION
def format_medical_qa(example: dict, tok) -> dict:
    # Single-line docstring: Format a QA pair into LLaMA 3 chat format.
    conversation = [
        {"role": "system", "content": "You are an expert medical information assistant."},
        {"role": "user", "content": example["question"]},
        {"role": "assistant", "content": example["answer"]},
    ]
    text = tok.apply_chat_template(conversation, tokenize=False)
    return {"text": text}

sample_data = [
    {"question": "What is HbA1c?", "answer": "HbA1c measures average blood glucose over 2-3 months. Below 5.7% = normal."},
    {"question": "How does insulin resistance develop?", "answer": "Insulin resistance occurs when cells don't respond normally to insulin."},
    {"question": "Difference between Type 1 and Type 2 diabetes?", "answer": "Type 1: autoimmune. Type 2: lifestyle-related insulin resistance."},
] * 100

dataset = Dataset.from_list(sample_data)
formatted_ds = dataset.map(lambda x: format_medical_qa(x, tokenizer), remove_columns=dataset.column_names)
print(f"Training examples: {len(formatted_ds)}")

Tip

Practice Chat Templates Formatting Data for Instruction Tuning in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Technical diagram.

Practice Task

Note

Practice Task — (1) Write a working example of Chat Templates Formatting Data for Instruction Tuning from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Chat Templates Formatting Data for Instruction Tuning is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module

Chat Template Formatting

from transformers import AutoTokenizer
from datasets import Dataset

# USE apply_chat_template -- always correct for any model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a medical information assistant. Be accurate."},
    {"role": "user", "content": "What are the symptoms of type 2 diabetes?"},
    {"role": "assistant", "content": "Common symptoms: frequent urination, increased thirst, fatigue, blurred vision."},
]

formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=False
)
print(formatted)

# DATASET FORMATTING FUNCTION
def format_medical_qa(example: dict, tok) -> dict:
    # Single-line docstring: Format a QA pair into LLaMA 3 chat format.
    conversation = [
        {"role": "system", "content": "You are an expert medical information assistant."},
        {"role": "user", "content": example["question"]},
        {"role": "assistant", "content": example["answer"]},
    ]
    text = tok.apply_chat_template(conversation, tokenize=False)
    return {"text": text}

sample_data = [
    {"question": "What is HbA1c?", "answer": "HbA1c measures average blood glucose over 2-3 months. Below 5.7% = normal."},
    {"question": "How does insulin resistance develop?", "answer": "Insulin resistance occurs when cells don't respond normally to insulin."},
    {"question": "Difference between Type 1 and Type 2 diabetes?", "answer": "Type 1: autoimmune. Type 2: lifestyle-related insulin resistance."},
] * 100

dataset = Dataset.from_list(sample_data)
formatted_ds = dataset.map(lambda x: format_medical_qa(x, tokenizer), remove_columns=dataset.column_names)
print(f"Training examples: {len(formatted_ds)}")

Tip

Diagram

Loading diagram…

Technical diagram.

Topics in This Module