Mini Project: Production ML API with Monitoring

Build a complete production-grade ML API: sentiment classifier with FastAPI, ONNX optimization, Redis caching, Prometheus monitoring, Grafana dashboard, automated drift detection, and Docker Compose deployment. End-to-end production system.

60 min•By Priygop Team•Updated 2026

Complete Production ML System

# docker-compose.yml -- complete production stack
compose_config = '''
version: '3.8'
services:

  ai-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - REDIS_URL=redis://redis:6379
      - MLFLOW_TRACKING_URI=http://mlflow:5000
    depends_on:
      - redis
      - prometheus
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      start_period: 60s

  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards

  mlflow:
    image: ghcr.io/mlflow/mlflow:latest
    ports: ["5000:5000"]
    command: mlflow server --host 0.0.0.0 --port 5000

  drift-detector:
    build:
      context: .
      dockerfile: Dockerfile.monitoring
    environment:
      - API_URL=http://ai-api:8000
      - REDIS_URL=redis://redis:6379
    command: python drift_detector.py --interval 300  # check every 5 minutes
'''

# Production API with Redis caching
import redis, hashlib, json, torch, time
from fastapi import FastAPI, Request
from prometheus_fastapi_instrumentator import Instrumentator
import onnxruntime as ort
from transformers import AutoTokenizer

app = FastAPI()
Instrumentator().instrument(app).expose(app)
cache = redis.Redis.from_url("redis://localhost:6379", decode_responses=True)

# Load ONNX model (from previous topic)
session = ort.InferenceSession("sentiment_model_int8.onnx", providers=["CPUExecutionProvider"])
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

def get_cache_key(text: str) -> str:
    return f"sentiment:{hashlib.sha256(text.encode()).hexdigest()[:16]}"

@app.post("/predict")
async def predict(request: Request) -> dict:
    body = await request.json()
    texts = body.get("texts", [])

    results = []
    cache_hits = 0

    for text in texts:
        cache_key = get_cache_key(text)
        cached = cache.get(cache_key)

        if cached:
            results.append(json.loads(cached))
            cache_hits += 1
            continue

        # ONNX inference
        enc = tokenizer(text, return_tensors="np", truncation=True, max_length=128, padding="max_length")
        logits = session.run(None, {"input_ids": enc["input_ids"], "attention_mask": enc["attention_mask"]})[0]
        probs = torch.softmax(torch.tensor(logits), dim=-1)[0].tolist()

        result = {"label": "positive" if probs[1] > probs[0] else "negative", "score": round(max(probs), 4)}
        results.append(result)
        cache.setex(cache_key, 3600, json.dumps(result))  # cache for 1 hour

    return {
        "results": results,
        "cache_hit_rate": cache_hits / max(len(texts), 1),
        "model": "distilbert-int8-onnx",
    }

# Start: docker compose up --build
# Access Grafana: http://localhost:3000 (admin/admin)
# View Prometheus metrics: http://localhost:9090
# MLflow experiments: http://localhost:5000

Tip

Practice Mini Project Production ML API with Monitoring in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Modern NLP = Transformer-based. Pre-train, then fine-tune.

Practice Task

Note

Practice Task — (1) Write a working example of Mini Project Production ML API with Monitoring from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Mini Project Production ML API with Monitoring is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module

Complete Production ML System

# docker-compose.yml -- complete production stack
compose_config = '''
version: '3.8'
services:

  ai-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - REDIS_URL=redis://redis:6379
      - MLFLOW_TRACKING_URI=http://mlflow:5000
    depends_on:
      - redis
      - prometheus
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      start_period: 60s

  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards

  mlflow:
    image: ghcr.io/mlflow/mlflow:latest
    ports: ["5000:5000"]
    command: mlflow server --host 0.0.0.0 --port 5000

  drift-detector:
    build:
      context: .
      dockerfile: Dockerfile.monitoring
    environment:
      - API_URL=http://ai-api:8000
      - REDIS_URL=redis://redis:6379
    command: python drift_detector.py --interval 300  # check every 5 minutes
'''

# Production API with Redis caching
import redis, hashlib, json, torch, time
from fastapi import FastAPI, Request
from prometheus_fastapi_instrumentator import Instrumentator
import onnxruntime as ort
from transformers import AutoTokenizer

app = FastAPI()
Instrumentator().instrument(app).expose(app)
cache = redis.Redis.from_url("redis://localhost:6379", decode_responses=True)

# Load ONNX model (from previous topic)
session = ort.InferenceSession("sentiment_model_int8.onnx", providers=["CPUExecutionProvider"])
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

def get_cache_key(text: str) -> str:
    return f"sentiment:{hashlib.sha256(text.encode()).hexdigest()[:16]}"

@app.post("/predict")
async def predict(request: Request) -> dict:
    body = await request.json()
    texts = body.get("texts", [])

    results = []
    cache_hits = 0

    for text in texts:
        cache_key = get_cache_key(text)
        cached = cache.get(cache_key)

        if cached:
            results.append(json.loads(cached))
            cache_hits += 1
            continue

        # ONNX inference
        enc = tokenizer(text, return_tensors="np", truncation=True, max_length=128, padding="max_length")
        logits = session.run(None, {"input_ids": enc["input_ids"], "attention_mask": enc["attention_mask"]})[0]
        probs = torch.softmax(torch.tensor(logits), dim=-1)[0].tolist()

        result = {"label": "positive" if probs[1] > probs[0] else "negative", "score": round(max(probs), 4)}
        results.append(result)
        cache.setex(cache_key, 3600, json.dumps(result))  # cache for 1 hour

    return {
        "results": results,
        "cache_hit_rate": cache_hits / max(len(texts), 1),
        "model": "distilbert-int8-onnx",
    }

# Start: docker compose up --build
# Access Grafana: http://localhost:3000 (admin/admin)
# View Prometheus metrics: http://localhost:9090
# MLflow experiments: http://localhost:5000

Tip

Diagram

Loading diagram…

Modern NLP = Transformer-based. Pre-train, then fine-tune.

Topics in This Module