Master these 31 carefully curated interview questions to ace your next Cloud devops interview.
DevOps is a culture and set of practices that unifies development and operations, emphasizing automation, CI/CD, and collaboration.
Core principles: (1) Culture: shared responsibility, no silos. (2) Automation: CI/CD pipelines, IaC, automated testing. (3) Measurement: metrics, monitoring, feedback loops. (4) Sharing: knowledge sharing, blameless postmortems. Practices: continuous integration, continuous delivery, infrastructure as code, monitoring, microservices. Tools: Git, Jenkins/GitHub Actions, Docker, Kubernetes, Terraform, Prometheus. DORA metrics: deployment frequency, lead time, change failure rate, MTTR.
CI (Continuous Integration) auto-builds and tests code on every commit; CD (Continuous Delivery/Deployment) auto-deploys to production.
CI: developers merge frequently → automated build → automated tests → fast feedback. CD (Delivery): code is always deployable, manual release approval. CD (Deployment): fully automated deployment to production. Pipeline: commit → build → unit tests → integration tests → staging → production. Tools: GitHub Actions, GitLab CI, Jenkins, CircleCI. Best practices: small commits, fast tests, trunk-based development, feature flags.
Docker packages applications with their dependencies into containers — lightweight, portable, isolated units that run consistently anywhere.
Concepts: Image (blueprint), Container (running instance), Dockerfile (build instructions), Registry (Docker Hub). Benefits: consistent environments, fast startup, resource efficient (share OS kernel), portable. Dockerfile: FROM, RUN, COPY, CMD. docker build/run/push commands. Docker Compose: multi-container apps (docker-compose.yml). Best practices: small base images (Alpine), multi-stage builds, non-root user, .dockerignore.
Cloud computing delivers IT resources on-demand. Models: IaaS (infrastructure), PaaS (platform), SaaS (software).
IaaS: raw compute, storage, networking. You manage: OS, runtime, apps. Examples: AWS EC2, Azure VMs, GCP Compute Engine. PaaS: managed platform, focus on code. You manage: apps and data. Examples: Heroku, AWS Elastic Beanstalk, Google App Engine. SaaS: complete software. You manage: nothing. Examples: Gmail, Salesforce, Slack. Also: FaaS/Serverless (AWS Lambda), CaaS (ECS, GKE). Deployment: public, private, hybrid, multi-cloud.
Kubernetes is an open-source container orchestration platform that automates deployment, scaling, and management of containerized applications.
Core concepts: Pod (smallest unit, one or more containers), Deployment (desired state), Service (networking/load balancing), Ingress (external access), ConfigMap/Secret (configuration), PersistentVolume (storage). Control plane: API server, scheduler, controller manager, etcd. Features: auto-scaling (HPA), self-healing, rolling updates, service discovery. kubectl: CLI tool. Managed: EKS (AWS), GKE (Google), AKS (Azure). Helm: package manager for K8s.
IaC manages infrastructure through code files instead of manual processes, enabling version control, automation, and reproducibility.
Tools: Terraform (multi-cloud, declarative), CloudFormation (AWS), Pulumi (general-purpose languages), Ansible (configuration management). Benefits: reproducible environments, version controlled, code review for infra changes, automated provisioning. Terraform workflow: write (.tf files) → plan (preview changes) → apply (create resources). State management: track what's deployed. Modules: reusable infrastructure components. GitOps: Git as single source of truth for infrastructure.
Monitoring tracks predefined metrics; observability enables understanding system behavior through logs, metrics, and traces.
Three pillars: (1) Metrics: numerical measurements (CPU, memory, request rate). Tools: Prometheus, Grafana, CloudWatch. (2) Logs: event records. Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Loki. (3) Traces: request journey across services. Tools: Jaeger, Zipkin, OpenTelemetry. Alerting: PagerDuty, OpsGenie. SLIs (indicators), SLOs (objectives), SLAs (agreements). Key metrics: latency, error rate, throughput, saturation (RED/USE methods).
A service mesh manages service-to-service communication with features like load balancing, encryption, and observability without code changes.
Implementation: sidecar proxy (Envoy) deployed alongside each service. Features: (1) mTLS encryption between services. (2) Load balancing and retry logic. (3) Circuit breaking. (4) Traffic management (canary, blue-green). (5) Observability (distributed tracing). Tools: Istio, Linkerd, Consul Connect. Control plane manages configuration, data plane handles traffic. Use when: many microservices, complex traffic patterns, security requirements. Adds complexity and resource overhead.
Roll back immediately, communicate status, investigate root cause, implement fixes, and conduct blameless postmortem.
Response: (1) Detect: alerting triggers. (2) Assess severity. (3) Roll back: revert to last known good version. (4) Communicate: status page, stakeholder updates. (5) Investigate: check logs, metrics, recent changes. (6) Fix: apply targeted fix if rollback insufficient. (7) Verify: confirm system healthy. (8) Postmortem: timeline, root cause, contributing factors, action items. Prevention: canary deployments, feature flags, automated rollbacks, chaos engineering, pre-production testing.
Netflix uses Spinnaker for continuous delivery, canary deployments, chaos engineering, and immutable infrastructure on AWS.
Netflix's practices: (1) Spinnaker: open-source CD platform for multi-cloud. (2) Canary deployments: deploy to small percentage, compare metrics. (3) Chaos engineering: Chaos Monkey randomly kills instances. (4) Immutable infrastructure: never patch, always replace. (5) Microservices: hundreds of services deployed independently. (6) Full AWS deployment. (7) Custom tools: Zuul (routing), Eureka (service discovery), Hystrix (circuit breaker). Key lesson: build for failure, automate everything.
IaaS provides infrastructure (VMs, storage); PaaS provides platforms (managed databases, app hosting); SaaS provides applications.
IaaS: you manage OS, middleware, runtime, app, data. Provider manages networking, storage, servers, virtualization. Examples: AWS EC2, Azure VMs, GCP Compute Engine. PaaS: you manage app and data only. Provider manages everything below. Examples: Heroku, AWS Elastic Beanstalk, Azure App Service, Google App Engine. SaaS: provider manages everything. Examples: Gmail, Salesforce, Slack, Office 365. Serverless/FaaS: event-driven, pay-per-execution (AWS Lambda, Azure Functions). Container-as-a-Service: AWS ECS/Fargate, Azure Container Apps. Choose based on control needs vs management overhead.
Docker packages applications with dependencies into containers — lightweight, portable units that run consistently across environments.
Container: isolated process with own filesystem (image layers), networking, and resource limits. Not a VM — shares host kernel. Dockerfile: FROM, RUN, COPY, CMD — build recipe. docker build creates image, docker run creates container. Image layers: cached, shared between images — storage efficient. Docker Compose: multi-container apps (app + database + redis). Registry: Docker Hub, ECR, GHCR store images. Best practices: small base images (alpine), multi-stage builds, non-root user, .dockerignore, layer caching order (dependencies before code). Podman: daemonless alternative.
CI (Continuous Integration) automates testing on every commit; CD (Continuous Delivery/Deployment) automates releases to production.
CI: developers merge frequently (daily+), automated build and tests run on every push. Catches integration issues early. CD (Delivery): every change is deployable after automated pipeline — manual release decision. CD (Deployment): every change that passes pipeline deploys automatically to production. Pipeline stages: lint → build → unit tests → integration tests → security scan → deploy to staging → deploy to production. Tools: GitHub Actions, GitLab CI, Jenkins, CircleCI, Azure DevOps. Benefits: faster release cycles, reduced risk, higher quality, quick feedback.
Kubernetes orchestrates containerized applications — handling deployment, scaling, networking, and self-healing across clusters.
Architecture: Control Plane (API server, scheduler, controller manager, etcd) + Worker Nodes (kubelet, kube-proxy, container runtime). Core objects: Pod (smallest deployable unit, one or more containers), Deployment (manages replicas, rolling updates), Service (networking, load balancing), Ingress (HTTP routing), ConfigMap/Secret (configuration). Scaling: HPA (Horizontal Pod Autoscaler) based on CPU/custom metrics. Self-healing: restarts failed containers, replaces unresponsive pods. Ecosystem: Helm (package manager), Istio (service mesh), ArgoCD (GitOps). Managed: EKS, AKS, GKE.
IaC manages infrastructure through code files instead of manual configuration, enabling version control, automation, and consistency.
Declarative: describe desired state, tool ensures reality matches. Terraform (HCL language, multi-cloud), Pulumi (real programming languages), CloudFormation (AWS). Imperative: script specific steps. Ansible (YAML playbooks, agentless). Terraform workflow: init → plan → apply. State management: terraform.tfstate tracks current infrastructure. Modules: reusable infrastructure components. Drift detection: plan shows differences between state and reality. Best practices: remote state (S3 + DynamoDB locking), modular design, environment separation (workspaces), code review for infrastructure changes. GitOps: git is source of truth for infrastructure.
Microservices decompose applications into small, independent services that communicate via APIs, each deployable independently.
Principles: single responsibility, independently deployable, own database, lightweight communication (REST, gRPC, events). Benefits: technology diversity, independent scaling, team autonomy, failure isolation. Challenges: distributed system complexity, data consistency, network latency, monitoring/debugging, deployment coordination. Patterns: API Gateway, Service Discovery (Consul, Eureka), Circuit Breaker (Resilience4j), Saga Pattern (distributed transactions), Event Sourcing, CQRS. Communication: synchronous (REST, gRPC) vs asynchronous (message queues — RabbitMQ, Kafka). Start monolith, extract microservices when team/scale demands.
Monitoring tracks known metrics; observability provides deep system understanding through logs, metrics, and traces (three pillars).
Three pillars: (1) Metrics: numerical measurements over time (CPU, memory, latency, error rate). Prometheus + Grafana. (2) Logs: discrete events with context. ELK Stack (Elasticsearch, Logstash, Kibana) or Loki. (3) Traces: request journey across services. Jaeger, Zipkin, OpenTelemetry. Alerting: PagerDuty, OpsGenie — alert on SLO breaches. SLI/SLO/SLA: Service Level Indicators (metrics), Objectives (targets), Agreements (contracts). RED method: Rate, Errors, Duration. USE method: Utilization, Saturation, Errors. OpenTelemetry: vendor-neutral telemetry collection standard.
GitOps uses Git as the single source of truth for infrastructure and application deployments, with automated reconciliation.
Principles: (1) Declarative: describe desired state in Git. (2) Versioned: Git provides audit trail and rollback. (3) Automated: agents pull and apply changes continuously. (4) Software agents: ensure system matches Git state. Tools: ArgoCD (Kubernetes), Flux (Kubernetes). Workflow: developer commits manifest change → PR review → merge to main → ArgoCD detects change → applies to cluster → monitors for drift. Benefits: audit trail, easy rollback (git revert), consistent environments, developer-friendly. Push vs Pull: push-based CI/CD vs pull-based GitOps agents. Progressive delivery: Argo Rollouts for canary/blue-green.
A service mesh manages service-to-service communication with features like traffic management, mTLS, observability, without code changes.
How: sidecar proxy (Envoy) deployed alongside each service intercepts all network traffic. Features: (1) mTLS: automatic encryption between services. (2) Traffic management: canary routing, circuit breaking, retries, timeouts. (3) Observability: distributed tracing, metrics, access logs without code. (4) Policy: rate limiting, access control. Tools: Istio (most popular, complex), Linkerd (lightweight, simpler), Cilium (eBPF-based, no sidecar). When needed: many microservices (50+), security requirements, complex traffic routing. When NOT: few services, simple architecture — adds complexity and resource overhead.
Auto-scaling adjusts compute resources based on metrics — horizontal (more instances) or vertical (bigger instances) — to handle load.
Horizontal (scale out/in): add/remove instances. AWS Auto Scaling Groups, Kubernetes HPA. Based on: CPU, memory, custom metrics (queue depth, request rate). Vertical (scale up/down): change instance size. Kubernetes VPA. Usually requires restart. Predictive: ML-based scaling based on historical patterns (AWS Predictive Scaling). Serverless: automatic with Lambda/Cloud Functions. Kubernetes: (1) HPA for pods. (2) Cluster Autoscaler for nodes. (3) KEDA for event-driven scaling. Cool-down period: prevent flapping. Policies: target tracking (maintain 70% CPU), step scaling (different amounts at different thresholds).
Detect and acknowledge, assess severity, communicate, mitigate, resolve root cause, communicate resolution, and post-mortem.
Process: (1) Detect: monitoring alert, user report. (2) Acknowledge: incident commander assigned (PagerDuty/OpsGenie). (3) Assess: severity level (SEV1-critical to SEV3-minor), affected systems and users. (4) Communicate: status page update (Statuspage.io), stakeholder notification. (5) Mitigate: immediate fix (rollback, scale up, failover, feature flag disable). (6) Investigate: logs, metrics, traces, recent changes. (7) Resolve: deploy fix. (8) Communicate: all-clear, customer communication. (9) Post-mortem: blameless, timeline, root cause, contributing factors, action items. (10) Follow-up: implement improvements, update runbooks.
Blue-green maintains two identical environments and switches traffic; canary gradually routes a small percentage to the new version.
Blue-green: two environments (blue=current, green=new). Deploy to green, test, switch load balancer/DNS. Instant rollback by switching back. Requires double infrastructure. Canary: deploy new version to small subset (1-5%), monitor metrics, gradually increase (10%, 25%, 50%, 100%). Rollback immediately if errors spike. Kubernetes: Argo Rollouts, Flagger. AWS: CodeDeploy with canary/linear strategies. Feature flags: LaunchDarkly for gradual feature rollout without deployment. A/B testing: compare metrics between versions. Progressive delivery: automated canary with metric analysis (Flagger + Prometheus).
EC2/ECS for compute, RDS for database, S3 for storage, CloudFront for CDN, Route53 for DNS, and VPC for networking.
Compute: EC2 (VMs), ECS/Fargate (containers), Lambda (serverless). Database: RDS (managed SQL), DynamoDB (NoSQL), ElastiCache (Redis/Memcached). Storage: S3 (objects), EBS (block), EFS (file). Networking: VPC, ALB (load balancer), Route53 (DNS), CloudFront (CDN). Security: IAM (access), KMS (encryption), WAF (web firewall), Secrets Manager. Messaging: SQS (queue), SNS (pub/sub), EventBridge (events). Monitoring: CloudWatch (metrics/logs), X-Ray (tracing). Cost: Reserved Instances (up to 75% savings), Spot Instances (up to 90%), Savings Plans.
Terraform is a declarative IaC tool using HCL language to provision and manage cloud resources across multiple providers.
Workflow: write .tf files → terraform init (install providers) → plan (show changes) → apply (create/modify resources). State: terraform.tfstate maps config to real resources. Remote state: S3/GCS backend with state locking (DynamoDB). Modules: reusable infrastructure components (VPC module, EKS module). Providers: AWS, Azure, GCP, Kubernetes, GitHub — 3000+ providers. Import: bring existing infrastructure under Terraform management. Workspaces: environment separation (dev/staging/prod). Best practices: small targeted configurations, module registry, policy as code (Sentinel/OPA), PR-based changes.
Use rolling updates, blue-green deployments, database migrations with backward compatibility, and health checks.
Strategies: (1) Rolling update: replace instances one at a time. Kubernetes default with maxSurge/maxUnavailable. (2) Blue-green: swap entire environments. (3) Canary: gradual traffic shift. Requirements: (1) Health checks: readiness and liveness probes. (2) Backward-compatible database changes: add column → deploy code → migrate data → drop old column. (3) Graceful shutdown: stop accepting new requests, finish in-flight, then terminate. (4) Connection draining: load balancer waits for active connections. (5) Database migrations: separate from application deployment. (6) Feature flags: deploy code dark, enable via flag.
12-factor app defines best practices for building cloud-native applications: config in environment, stateless processes, disposable instances.
Factors: (1) Codebase: one repo per app. (2) Dependencies: explicitly declared. (3) Config: in environment variables. (4) Backing services: treat as attached resources. (5) Build/release/run: strict separation. (6) Processes: stateless, share-nothing. (7) Port binding: self-contained, export via port. (8) Concurrency: scale via processes. (9) Disposability: fast startup, graceful shutdown. (10) Dev/prod parity: keep similar. (11) Logs: treat as event streams. (12) Admin processes: run as one-off tasks. Benefits: portability across cloud providers, scalability, maintainability. Most containers and serverless follow these principles.
Apply least privilege IAM, encrypt data at rest and transit, network segmentation, monitoring, patching, and compliance frameworks.
IAM: least privilege, no root access, MFA, service roles over access keys, regular access reviews. Network: VPC, private subnets, security groups (least open), NACLs, WAF, DDoS protection (Shield). Data: encryption at rest (KMS), in transit (TLS), key rotation. Secrets: Secrets Manager/Vault, never in code/Git. Compliance: SOC2, ISO 27001, HIPAA with cloud compliance tools. Monitoring: CloudTrail audit logs, GuardDuty threat detection, Security Hub. Patching: automated OS updates, container image scanning. CSPM: Cloud Security Posture Management (Prisma Cloud, Wiz). Infrastructure scanning: Checkov, tfsec for Terraform.
Define RPO/RTO, implement backup strategy, multi-region architecture, automated failover, and regular disaster recovery testing.
RPO (Recovery Point Objective): maximum acceptable data loss (how often to backup). RTO (Recovery Time Objective): maximum acceptable downtime. Strategies (cost increases): (1) Backup & Restore: backup to S3, restore when needed (hours RTO). (2) Pilot Light: minimal core running in DR region, scale up on failover (minutes). (3) Warm Standby: scaled-down duplicate running, scale up on failover (minutes). (4) Multi-site Active-Active: full duplicate in multiple regions, instant failover (seconds). Automation: CloudFormation/Terraform to recreate infrastructure. Testing: regular DR drills (chaos engineering). Database: cross-region replication, automated failover (RDS Multi-AZ).
Chaos engineering intentionally introduces failures in production to verify system resilience and identify weaknesses proactively.
Netflix pioneered with Chaos Monkey (randomly terminates instances). Principles: (1) Start with a hypothesis about steady state. (2) Vary real-world events (server failure, network partition, latency injection). (3) Run experiments in production (with controls). (4) Automate experiments in CI/CD. (5) Minimize blast radius. Tools: Chaos Monkey (instances), Latency Monkey (delays), Gremlin (commercial platform), LitmusChaos (Kubernetes), AWS FIS. GameDays: planned chaos exercises. Prerequisites: observability (know what normal looks like), circuit breakers, retry logic, fallbacks. Build resilience before running chaos experiments.
FinOps is the practice of managing cloud costs collaboratively. Optimize with right-sizing, reserved instances, spot instances, and automation.
FinOps framework: Inform (visibility into costs), Optimize (reduce waste), Operate (continuous improvement). Strategies: (1) Right-sizing: match instance size to actual usage. (2) Reserved Instances/Savings Plans: 1-3 year commitment, 30-75% savings. (3) Spot/Preemptible instances: 60-90% savings for fault-tolerant workloads. (4) Auto-scaling: scale down during low usage. (5) Storage tiering: S3 lifecycle policies, cold storage. (6) Cleanup: terminate unused resources, delete old snapshots. (7) Serverless: pay per execution for variable workloads. Tools: AWS Cost Explorer, Spot.io, Kubecost (Kubernetes). Tagging: mandatory cost allocation tags. Budget alerts: proactive notifications.
Ready to master Cloud devops?
Start learning with our comprehensive course and practice these questions.