AWS Well-Architected Framework
The AWS Well-Architected Framework provides a structured set of best practices for building secure, high-performing, resilient, and efficient cloud infrastructure. Its six pillars serve as an architectural review checklist — useful not just for AWS but as a general framework for evaluating any cloud architecture's quality.
The Six Pillars
The Well-Architected Framework evaluates architectures across six dimensions, each representing a distinct aspect of system quality:
Operational Excellence: how well the system is operated and monitored — runbooks, post-mortems, automation of manual operations, event-driven operations. Key question: 'Can you operate this system effectively at 2am when you're not familiar with it?'
Security: how the system protects information and systems — IAM, data protection, incident response, infrastructure protection. Key question: 'What is the blast radius if any single component is compromised?'
Reliability: how the system recovers from failures — fault isolation, backup and recovery, change management, load scaling. Key question: 'How does the system behave when a component fails?'
Performance Efficiency: how efficiently the system uses resources — right instance types, serverless where appropriate, caching, database optimization. Key question: 'Are we using the most efficient technology for each workload?'
Cost Optimization: how the system avoids unnecessary costs — right-sizing, reserved instances, spot instances, cost allocation. Key question: 'Are we paying for resources we're not using?'
Sustainability (added 2021): the environmental impact — energy-efficient instances, serverless to reduce idle capacity, graviton processors. Key question: 'How do we minimize our carbon footprint while meeting requirements?'
Know what YOU manage vs what the cloud PROVIDER manages
Pillar Review Questions
- Operational Excellence — Are all operations managed via code (no manual server changes)? Is observability complete (metrics, logs, traces)? Does every alert have a runbook? Is there a blameless post-mortem process?
- Security — Is IAM using least privilege? Are secrets managed via Vault/Secrets Manager (never hardcoded)? Is encryption enforced at rest and in transit? Is there a defined incident response plan?
- Reliability — Is the system deployed across multiple AZs? Is there automatic failover? Are backups tested (not just created)? Is there a load-tested capacity plan for peak traffic?
- Performance Efficiency — Are instance types right-sized to actual usage (not over-provisioned)? Is CDN used for static content? Are database queries optimized with appropriate indexes? Is caching implemented at appropriate layers?
- Cost Optimization — Are unused resources terminated automatically? Are reserved instances or savings plans used for predictable workloads? Are cost allocation tags applied to all resources? Is there a monthly cloud cost review process?
- Sustainability — Are Graviton (ARM) instances used where compatible (40% more energy efficient)? Is serverless used for variable workloads? Are test/dev environments shut down when not needed?
Quick Quiz
🛠️ Practical: SOC 2 Evidence Automation with AWS CloudTrail
# SOC 2 requires audit evidence collected continuously throughout the period
# CloudTrail is your primary source for access and change management evidence
# 1. Enable CloudTrail in ALL regions (critical — default is one region)
aws cloudtrail create-trail --name prod-audit-trail --s3-bucket-name my-cloudtrail-audit-logs --is-multi-region-trail --enable-log-file-validation # Detect log tampering
aws cloudtrail start-logging --name prod-audit-trail
# 2. Enable S3 Object Lock on audit bucket (immutable logs)
aws s3api put-object-lock-configuration --bucket my-cloudtrail-audit-logs --object-lock-configuration '{"ObjectLockEnabled":"Enabled","Rule":{"DefaultRetention":{"Mode":"COMPLIANCE","Years":1}}}'
# 3. Query CloudTrail for SOC 2 evidence: who accessed what?
aws cloudtrail lookup-events --lookup-attributes AttributeKey=Username,AttributeValue=john.doe --start-time 2024-01-01 --end-time 2024-01-31 --query 'Events[*].{Time:EventTime,Event:EventName,Resource:Resources[0].ResourceName}'
# 4. Detect unauthorized root account usage (SOC 2 finding if common)
aws cloudtrail lookup-events --lookup-attributes AttributeKey=Username,AttributeValue=root --query 'Events[*].{Time:EventTime,Event:EventName}'
# 5. Config Rules — continuous compliance checking
# Auto-detect SOC 2 violations like unencrypted S3 buckets or open security groups
aws configservice put-config-rule --config-rule '{
"ConfigRuleName": "s3-bucket-server-side-encryption-enabled",
"Source": { "Owner": "AWS", "SourceIdentifier": "S3_BUCKET_SERVER_SIDE_ENCRYPTION_ENABLED" }
}'
# Non-compliant resources create automatic evidence of a control gap🧠 Module Quiz
What is the difference between SOC 2 Type I and Type II?
Tip
Tip
Practice AWS WellArchitected Framework in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Practice Task
Note
Practice Task — (1) Write a working example of AWS WellArchitected Framework from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Common Mistake
Warning
A common mistake with AWS WellArchitected Framework is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready cloud code.
Key Takeaways
- The AWS Well-Architected Framework provides a structured set of best practices for building secure, high-performing, resilient, and efficient cloud infrastructure.
- Operational Excellence — Are all operations managed via code (no manual server changes)? Is observability complete (metrics, logs, traces)? Does every alert have a runbook? Is there a blameless post-mortem process?
- Security — Is IAM using least privilege? Are secrets managed via Vault/Secrets Manager (never hardcoded)? Is encryption enforced at rest and in transit? Is there a defined incident response plan?
- Reliability — Is the system deployed across multiple AZs? Is there automatic failover? Are backups tested (not just created)? Is there a load-tested capacity plan for peak traffic?