Observability & Monitoring
Build comprehensive observability into your systems using the three pillars — metrics, logs, and traces — to understand system behavior and debug issues quickly.
55 min•By Priygop Team•Last updated: Feb 2026
The Three Pillars of Observability
- Metrics: Numerical measurements over time — CPU usage, request rate, error count, latency percentiles. Tools: Prometheus, Datadog, CloudWatch. Best for alerting and dashboards
- Logs: Timestamped text records of discrete events — 'User 123 failed login at 14:32:05'. Tools: ELK Stack, Loki, Splunk. Best for debugging specific issues
- Traces: End-to-end request paths across services — show which microservice took how long and where errors occurred. Tools: Jaeger, Zipkin, Tempo, Datadog APM
- Together: Metrics tell you SOMETHING is wrong (alert), logs tell you WHAT went wrong (debug), traces tell you WHERE in the request path it went wrong (pinpoint)
- OpenTelemetry: Vendor-neutral standard for instrumentation — one SDK generates metrics, logs, and traces. Export to any backend. Becoming the industry standard
- Structured Logging: Log in JSON format with consistent fields (timestamp, service, trace_id, user_id) — enables powerful querying and correlation across services
Alerting Best Practices
- Alert on symptoms, not causes: Alert on 'error rate > 1%' not 'CPU > 90%'. Users care about errors, not CPU. High CPU might be normal during peak traffic
- Use multi-window burn rates: Alert when error budget consumption is abnormally fast — a 1-hour window catches fast burns, a 6-hour window catches slow burns
- Page only for urgent, user-impacting issues: Everything else goes to a ticket queue. Alert fatigue (too many non-actionable alerts) is the #1 on-call morale killer
- Include runbooks: Every alert should link to a runbook explaining what the alert means, potential causes, and step-by-step investigation/mitigation procedures
- Severity levels: P0 (customer-impacting, page immediately), P1 (degraded, page during business hours), P2 (concerning, ticket), P3 (informational, no action needed)
- Regularly prune alerts: Review alert history monthly — delete alerts that never fire or always fire. A healthy alert set has 90%+ actionable rate