Mini Project: Production ML API with Monitoring
Build a complete production-grade ML API: sentiment classifier with FastAPI, ONNX optimization, Redis caching, Prometheus monitoring, Grafana dashboard, automated drift detection, and Docker Compose deployment. End-to-end production system.
Complete Production ML System
# docker-compose.yml -- complete production stack
compose_config = '''
version: '3.8'
services:
ai-api:
build: .
ports:
- "8000:8000"
environment:
- REDIS_URL=redis://redis:6379
- MLFLOW_TRACKING_URI=http://mlflow:5000
depends_on:
- redis
- prometheus
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
start_period: 60s
redis:
image: redis:7-alpine
ports: ["6379:6379"]
command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports: ["9090:9090"]
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
mlflow:
image: ghcr.io/mlflow/mlflow:latest
ports: ["5000:5000"]
command: mlflow server --host 0.0.0.0 --port 5000
drift-detector:
build:
context: .
dockerfile: Dockerfile.monitoring
environment:
- API_URL=http://ai-api:8000
- REDIS_URL=redis://redis:6379
command: python drift_detector.py --interval 300 # check every 5 minutes
'''
# Production API with Redis caching
import redis, hashlib, json, torch, time
from fastapi import FastAPI, Request
from prometheus_fastapi_instrumentator import Instrumentator
import onnxruntime as ort
from transformers import AutoTokenizer
app = FastAPI()
Instrumentator().instrument(app).expose(app)
cache = redis.Redis.from_url("redis://localhost:6379", decode_responses=True)
# Load ONNX model (from previous topic)
session = ort.InferenceSession("sentiment_model_int8.onnx", providers=["CPUExecutionProvider"])
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
def get_cache_key(text: str) -> str:
return f"sentiment:{hashlib.sha256(text.encode()).hexdigest()[:16]}"
@app.post("/predict")
async def predict(request: Request) -> dict:
body = await request.json()
texts = body.get("texts", [])
results = []
cache_hits = 0
for text in texts:
cache_key = get_cache_key(text)
cached = cache.get(cache_key)
if cached:
results.append(json.loads(cached))
cache_hits += 1
continue
# ONNX inference
enc = tokenizer(text, return_tensors="np", truncation=True, max_length=128, padding="max_length")
logits = session.run(None, {"input_ids": enc["input_ids"], "attention_mask": enc["attention_mask"]})[0]
probs = torch.softmax(torch.tensor(logits), dim=-1)[0].tolist()
result = {"label": "positive" if probs[1] > probs[0] else "negative", "score": round(max(probs), 4)}
results.append(result)
cache.setex(cache_key, 3600, json.dumps(result)) # cache for 1 hour
return {
"results": results,
"cache_hit_rate": cache_hits / max(len(texts), 1),
"model": "distilbert-int8-onnx",
}
# Start: docker compose up --build
# Access Grafana: http://localhost:3000 (admin/admin)
# View Prometheus metrics: http://localhost:9090
# MLflow experiments: http://localhost:5000Tip
Tip
Practice Mini Project Production ML API with Monitoring in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Modern NLP = Transformer-based. Pre-train, then fine-tune.
Practice Task
Note
Practice Task — (1) Write a working example of Mini Project Production ML API with Monitoring from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Mini Project Production ML API with Monitoring is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.