When an e-commerce client came to me with a $45,000/month AWS bill that was growing 15% month-over-month, we knew something had to change. Over 8 weeks, we systematically analyzed every line item and brought the bill down to $27,000/month - a 40% reduction - without degrading performance or reliability. Here's exactly how we did it.
Step 1: AWS Cost Explorer Deep Dive
Before optimizing anything, you need to understand where the money is going. Cost Explorer is your starting point:
- Group by service to find the top cost drivers (usually EC2, RDS, and data transfer)
- Group by tag to identify which teams or projects are spending the most
- Enable hourly granularity to spot usage patterns (are dev environments running 24/7?)
- Check untagged resources - these are often forgotten resources nobody owns
For this client, the breakdown was: EC2 (38%), RDS (22%), Data Transfer (15%), NAT Gateway (12%), S3 (8%), Other (5%).
Step 2: Right-Sizing EC2 Instances
This is almost always the biggest win. AWS Compute Optimizer and CloudWatch metrics tell you exactly which instances are over-provisioned:
# Get right-sizing recommendations from Compute Optimizer
aws compute-optimizer get-ec2-instance-recommendations \
--query 'instanceRecommendations[*].{
InstanceId: instanceArn,
Current: currentInstanceType,
Recommended: recommendationOptions[0].instanceType,
Savings: recommendationOptions[0].estimatedMonthlySavings.value
}' --output table
# Check actual CPU utilization over 14 days
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time $(date -d '-14 days' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average Maximum \
--output table
What we found: 60% of EC2 instances were using less than 20% of their allocated CPU. We moved from m5.2xlarge to m5.large for most application servers, saving $8,400/month.
Step 3: Reserved Instances and Savings Plans
For workloads with predictable usage (databases, baseline application servers), Savings Plans provide up to 72% discount:
- Compute Savings Plans - Most flexible, covers EC2, Fargate, and Lambda
- EC2 Instance Savings Plans - Higher discount but locked to instance family
- Reserved Instances for RDS - 1-year No Upfront RI is my default recommendation
We purchased 1-year Compute Savings Plans covering 70% of baseline usage, saving $6,200/month on EC2 and Fargate combined.
Step 4: NAT Gateway - The Hidden Bill Killer
NAT Gateway charges $0.045/GB for data processing plus hourly charges. For data-heavy applications, this adds up fast. Our client was paying $5,400/month on NAT Gateways alone.
Solutions we implemented:
- VPC endpoints for S3 and DynamoDB - Free and keeps traffic off the NAT Gateway
- Interface endpoints for ECR, CloudWatch, STS - $7/month each vs hundreds in NAT costs
- Moved batch processing to public subnets with security groups instead of NAT
# Terraform: VPC endpoints to reduce NAT Gateway costs
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = [aws_route_table.private.id]
}
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.dynamodb"
vpc_endpoint_type = "Gateway"
route_table_ids = [aws_route_table.private.id]
}
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
resource "aws_vpc_endpoint" "ecr_dkr" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.dkr"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
Result: NAT Gateway costs dropped from $5,400 to $1,200/month.
Step 5: S3 Storage Optimization
The client had 15TB of S3 data, most of which hadn't been accessed in months. We implemented intelligent tiering and lifecycle policies:
# Terraform: S3 lifecycle rules
resource "aws_s3_bucket_lifecycle_configuration" "app_data" {
bucket = aws_s3_bucket.app_data.id
rule {
id = "transition-to-ia"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER_IR"
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE"
}
}
rule {
id = "cleanup-incomplete-uploads"
status = "Enabled"
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
}
rule {
id = "expire-old-versions"
status = "Enabled"
noncurrent_version_expiration {
noncurrent_days = 30
}
}
}
Step 6: CloudWatch Log Costs
CloudWatch Logs ingestion costs $0.50/GB. Our client was ingesting 200GB/day of logs, most of which were debug-level application logs nobody looked at.
- Reduced application log levels from DEBUG to INFO in production
- Set log retention to 30 days instead of "Never Expire"
- Moved long-term log storage to S3 via Kinesis Firehose at a fraction of the cost
Savings: $2,100/month on CloudWatch alone.
Step 7: Automated Cost Alerts
Prevention is better than cure. We set up AWS Budgets to catch cost spikes early:
# Terraform: AWS Budget alerts
resource "aws_budgets_budget" "monthly" {
name = "monthly-total-budget"
budget_type = "COST"
limit_amount = "30000"
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = ["platform-team@company.com"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["cto@company.com", "platform-team@company.com"]
}
}
Step 8: Dev/Staging Environment Scheduling
Dev and staging environments don't need to run 24/7. We implemented automatic shutdown schedules:
- Dev environments: Running 8am-8pm weekdays only (60% savings)
- Staging: Running 6am-10pm weekdays, 10am-6pm weekends (50% savings)
- Used AWS Instance Scheduler for EC2 and RDS
- Set ECS desired count to 0 during off-hours via EventBridge + Lambda
Results Summary
Here's the final breakdown of savings:
- EC2 right-sizing: -$8,400/month
- Savings Plans: -$6,200/month
- NAT Gateway optimization: -$4,200/month
- CloudWatch log reduction: -$2,100/month
- S3 lifecycle policies: -$1,800/month
- Dev/staging scheduling: -$2,300/month
Total monthly savings: ~$25,000 (from $45K to ~$20K after all optimizations settled)
Monthly Cost Review Checklist
I now run this checklist monthly with every client:
- Review Cost Explorer for anomalies and trends
- Check Compute Optimizer for new right-sizing recommendations
- Audit untagged resources and assign ownership
- Review Savings Plan utilization and coverage
- Check for idle resources (unused EBS volumes, unattached EIPs, empty load balancers)
- Verify dev/staging schedules are working correctly
- Review data transfer costs for new optimization opportunities
Conclusion
AWS cost optimization isn't a one-time project - it's an ongoing practice. The key is building cost awareness into your team's culture and making it part of every architecture decision. Start with the quick wins (right-sizing, Savings Plans, NAT Gateway optimization) and build from there. The 40% savings we achieved wasn't magic - it was systematic analysis and disciplined execution.
If your AWS bill is higher than it should be, feel free to book a consultation - I'd love to help you find those savings.