Optimized LLM Serving — vLLM and TGI
Standard HuggingFace inference is not efficient for LLM serving: it doesn't batch concurrent requests, doesn't cache KV tensors, and can't handle multiple parallel requests efficiently. vLLM and TGI (Text Generation Inference) solve this with paged attention, continuous batching, and speculative decoding.
vLLM for High-Throughput LLM Serving
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# vLLM -- high-throughput LLM serving
# pip install vllm
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
from vllm import LLM, SamplingParams
from vllm.distributed import destroy_model_parallel
# THREE KEY INNOVATIONS OF vLLM:
vllm_innovations = {
"PagedAttention": '''KV cache fragmentation is the #1 LLM memory issue.
HuggingFace: allocates max_length KV cache upfront per request (wasteful).
vLLM: manages KV cache like OS virtual memory -- allocate on demand in pages.
Result: 2-3x more concurrent requests fit in GPU memory.''',
"Continuous Batching": '''Static batching: wait for a batch to fill, process together.
Problem: short sequences finish but long ones continue -- GPU sits idle.
Continuous batching: as soon as one sequence finishes, add a new one to the batch.
Result: GPU utilization 80-95% (vs 30-50% with static batching).''',
"Speculative Decoding": '''Speed up generation by predicting multiple tokens with a small model,
then verify with the large model in one forward pass.
If speculation correct: multiple tokens per step.
Net effect: 2-3x speedup for outputs with predictable patterns (code, tables).''',
}
for innovation, explanation in vllm_innovations.items():
print(f"{innovation}:")
print(f" {explanation[:100]}...")
# OFFLINE INFERENCE (batch jobs)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=2, # split model across 2 GPUs
max_model_len=8192,
gpu_memory_utilization=0.9, # use 90% of GPU memory for KV cache
dtype="bfloat16",
quantization="awq", # or "gptq" -- optional quantization
)
sampling = SamplingParams(
temperature=0.7, top_p=0.9, max_tokens=512,
stop=["<|eot_id|>"], # LLaMA 3 end-of-turn token
)
# vLLM handles batching automatically -- send 1000 requests, get 1000 responses
prompts = [f"Question {i}: What is AI?" for i in range(100)]
outputs = llm.generate(prompts, sampling)
for output in outputs[:3]:
print(f" Prompt: {output.prompt[:50]}...")
print(f" Response: {output.outputs[0].text[:100]}...")
print()
# ONLINE SERVING (API server)
# vllm serve meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size 2 --port 8000
# This creates an OpenAI-compatible API:
from openai import OpenAI
vllm_client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = vllm_client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Explain transformer attention"}],
temperature=0.7, max_tokens=300,
)
# Throughput comparison:
print("\nvLLM vs HuggingFace Throughput Comparison (A100, LLaMA-3-8B):")
serving_comparison = {
"HuggingFace naive": "50-100 tokens/sec",
"HuggingFace + AMP": "150-200 tokens/sec",
"vLLM basic": "800-1200 tokens/sec",
"vLLM + AWQ INT4": "2000-3000 tokens/sec",
}
for system, throughput in serving_comparison.items():
print(f" {system:30s}: {throughput}")Tip
Tip
Practice Optimized LLM Serving vLLM and TGI in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
GPT-4 = strong reasoning. Claude = safety + long context. Gemini = multimodal. Llama = local/open.
Practice Task
Note
Practice Task — (1) Write a working example of Optimized LLM Serving vLLM and TGI from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Optimized LLM Serving vLLM and TGI is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.