Quantization & GGUF — Running LLMs on CPU
GGUF quantization lets you run 8B LLMs on a laptop CPU at 5-10 tokens/second without any GPU. Ollama makes this one command. GPTQ and AWQ provide GPU quantization with 4-8x memory reduction and minimal quality loss.
Quantization with GGUF and Ollama
# QUANTIZATION FORMATS
# GGUF Q4_K_M: 4-bit, best quality/size tradeoff -- use this for CPU
# GPTQ: GPU 4-bit with calibration dataset
# AWQ: Activation-aware Weight Quantization -- better than GPTQ
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
gptq_config = GPTQConfig(
bits=4,
dataset="c4",
tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B"),
)
# Quality comparison (perplexity on WikiText-2, lower=better):
quant_quality = {
"Full fp16 (baseline)": {"perplexity": 7.72, "size_GB": 16.0},
"AWQ Q4": {"perplexity": 7.84, "size_GB": 4.2},
"GPTQ Q4": {"perplexity": 7.90, "size_GB": 4.2},
"GGUF Q4_K_M": {"perplexity": 7.95, "size_GB": 4.9},
"GGUF Q2_K": {"perplexity": 8.80, "size_GB": 2.8},
}
for name, stats in quant_quality.items():
print(f" {name:25s}: ppl={stats['perplexity']} | size={stats['size_GB']}GB")
# OLLAMA -- one-command local LLM inference
# Install: https://ollama.ai
# ollama pull llama3.2
import ollama
response = ollama.chat(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a Python expert."},
{"role": "user", "content": "Write a function to find the longest common subsequence."},
]
)
print(response["message"]["content"])
# Streaming
for chunk in ollama.chat(model="llama3.2", messages=[{"role": "user", "content": "Count to 5"}], stream=True):
print(chunk["message"]["content"], end="", flush=True)Tip
Tip
Practice Quantization GGUF Running LLMs on CPU in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Modern NLP = Transformer-based. Pre-train, then fine-tune.
Practice Task
Note
Practice Task — (1) Write a working example of Quantization GGUF Running LLMs on CPU from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Quantization GGUF Running LLMs on CPU is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.