Quantization & GGUF — Running LLMs on CPU

GGUF quantization lets you run 8B LLMs on a laptop CPU at 5-10 tokens/second without any GPU. Ollama makes this one command. GPTQ and AWQ provide GPU quantization with 4-8x memory reduction and minimal quality loss.

20 min•By Priygop Team•Updated 2026

Quantization with GGUF and Ollama

# QUANTIZATION FORMATS
# GGUF Q4_K_M: 4-bit, best quality/size tradeoff -- use this for CPU
# GPTQ: GPU 4-bit with calibration dataset
# AWQ: Activation-aware Weight Quantization -- better than GPTQ

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B"),
)

# Quality comparison (perplexity on WikiText-2, lower=better):
quant_quality = {
    "Full fp16 (baseline)": {"perplexity": 7.72, "size_GB": 16.0},
    "AWQ Q4":               {"perplexity": 7.84, "size_GB":  4.2},
    "GPTQ Q4":              {"perplexity": 7.90, "size_GB":  4.2},
    "GGUF Q4_K_M":          {"perplexity": 7.95, "size_GB":  4.9},
    "GGUF Q2_K":            {"perplexity": 8.80, "size_GB":  2.8},
}

for name, stats in quant_quality.items():
    print(f"  {name:25s}: ppl={stats['perplexity']} | size={stats['size_GB']}GB")

# OLLAMA -- one-command local LLM inference
# Install: https://ollama.ai
# ollama pull llama3.2
import ollama

response = ollama.chat(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a Python expert."},
        {"role": "user", "content": "Write a function to find the longest common subsequence."},
    ]
)
print(response["message"]["content"])

# Streaming
for chunk in ollama.chat(model="llama3.2", messages=[{"role": "user", "content": "Count to 5"}], stream=True):
    print(chunk["message"]["content"], end="", flush=True)

Tip

Practice Quantization GGUF Running LLMs on CPU in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Modern NLP = Transformer-based. Pre-train, then fine-tune.

Practice Task

Note

Practice Task — (1) Write a working example of Quantization GGUF Running LLMs on CPU from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Quantization GGUF Running LLMs on CPU is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module

Quantization with GGUF and Ollama

# QUANTIZATION FORMATS
# GGUF Q4_K_M: 4-bit, best quality/size tradeoff -- use this for CPU
# GPTQ: GPU 4-bit with calibration dataset
# AWQ: Activation-aware Weight Quantization -- better than GPTQ

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B"),
)

# Quality comparison (perplexity on WikiText-2, lower=better):
quant_quality = {
    "Full fp16 (baseline)": {"perplexity": 7.72, "size_GB": 16.0},
    "AWQ Q4":               {"perplexity": 7.84, "size_GB":  4.2},
    "GPTQ Q4":              {"perplexity": 7.90, "size_GB":  4.2},
    "GGUF Q4_K_M":          {"perplexity": 7.95, "size_GB":  4.9},
    "GGUF Q2_K":            {"perplexity": 8.80, "size_GB":  2.8},
}

for name, stats in quant_quality.items():
    print(f"  {name:25s}: ppl={stats['perplexity']} | size={stats['size_GB']}GB")

# OLLAMA -- one-command local LLM inference
# Install: https://ollama.ai
# ollama pull llama3.2
import ollama

response = ollama.chat(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a Python expert."},
        {"role": "user", "content": "Write a function to find the longest common subsequence."},
    ]
)
print(response["message"]["content"])

# Streaming
for chunk in ollama.chat(model="llama3.2", messages=[{"role": "user", "content": "Count to 5"}], stream=True):
    print(chunk["message"]["content"], end="", flush=True)

Tip

Diagram

Loading diagram…

Modern NLP = Transformer-based. Pre-train, then fine-tune.

Topics in This Module