Back to Blogs

Infrastructure Cost Optimization: How I Saved 40% on AWS

AWS Cost Management August 15, 2024

When an e-commerce client came to me with a $45,000/month AWS bill that was growing 15% month-over-month, we knew something had to change. Over 8 weeks, we systematically analyzed every line item and brought the bill down to $27,000/month - a 40% reduction - without degrading performance or reliability. Here's exactly how we did it.

Step 1: AWS Cost Explorer Deep Dive

Before optimizing anything, you need to understand where the money is going. Cost Explorer is your starting point:

  • Group by service to find the top cost drivers (usually EC2, RDS, and data transfer)
  • Group by tag to identify which teams or projects are spending the most
  • Enable hourly granularity to spot usage patterns (are dev environments running 24/7?)
  • Check untagged resources - these are often forgotten resources nobody owns

For this client, the breakdown was: EC2 (38%), RDS (22%), Data Transfer (15%), NAT Gateway (12%), S3 (8%), Other (5%).

Step 2: Right-Sizing EC2 Instances

This is almost always the biggest win. AWS Compute Optimizer and CloudWatch metrics tell you exactly which instances are over-provisioned:

# Get right-sizing recommendations from Compute Optimizer
aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[*].{
    InstanceId: instanceArn,
    Current: currentInstanceType,
    Recommended: recommendationOptions[0].instanceType,
    Savings: recommendationOptions[0].estimatedMonthlySavings.value
  }' --output table

# Check actual CPU utilization over 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -d '-14 days' -u +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average Maximum \
  --output table

What we found: 60% of EC2 instances were using less than 20% of their allocated CPU. We moved from m5.2xlarge to m5.large for most application servers, saving $8,400/month.

Step 3: Reserved Instances and Savings Plans

For workloads with predictable usage (databases, baseline application servers), Savings Plans provide up to 72% discount:

  • Compute Savings Plans - Most flexible, covers EC2, Fargate, and Lambda
  • EC2 Instance Savings Plans - Higher discount but locked to instance family
  • Reserved Instances for RDS - 1-year No Upfront RI is my default recommendation

We purchased 1-year Compute Savings Plans covering 70% of baseline usage, saving $6,200/month on EC2 and Fargate combined.

Step 4: NAT Gateway - The Hidden Bill Killer

NAT Gateway charges $0.045/GB for data processing plus hourly charges. For data-heavy applications, this adds up fast. Our client was paying $5,400/month on NAT Gateways alone.

Solutions we implemented:

  • VPC endpoints for S3 and DynamoDB - Free and keeps traffic off the NAT Gateway
  • Interface endpoints for ECR, CloudWatch, STS - $7/month each vs hundreds in NAT costs
  • Moved batch processing to public subnets with security groups instead of NAT
# Terraform: VPC endpoints to reduce NAT Gateway costs
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = [aws_route_table.private.id]
}

resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.dynamodb"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = [aws_route_table.private.id]
}

resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

Result: NAT Gateway costs dropped from $5,400 to $1,200/month.

Step 5: S3 Storage Optimization

The client had 15TB of S3 data, most of which hadn't been accessed in months. We implemented intelligent tiering and lifecycle policies:

# Terraform: S3 lifecycle rules
resource "aws_s3_bucket_lifecycle_configuration" "app_data" {
  bucket = aws_s3_bucket.app_data.id

  rule {
    id     = "transition-to-ia"
    status = "Enabled"
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    transition {
      days          = 90
      storage_class = "GLACIER_IR"
    }
    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
  }

  rule {
    id     = "cleanup-incomplete-uploads"
    status = "Enabled"
    abort_incomplete_multipart_upload {
      days_after_initiation = 7
    }
  }

  rule {
    id     = "expire-old-versions"
    status = "Enabled"
    noncurrent_version_expiration {
      noncurrent_days = 30
    }
  }
}

Step 6: CloudWatch Log Costs

CloudWatch Logs ingestion costs $0.50/GB. Our client was ingesting 200GB/day of logs, most of which were debug-level application logs nobody looked at.

  • Reduced application log levels from DEBUG to INFO in production
  • Set log retention to 30 days instead of "Never Expire"
  • Moved long-term log storage to S3 via Kinesis Firehose at a fraction of the cost

Savings: $2,100/month on CloudWatch alone.

Step 7: Automated Cost Alerts

Prevention is better than cure. We set up AWS Budgets to catch cost spikes early:

# Terraform: AWS Budget alerts
resource "aws_budgets_budget" "monthly" {
  name         = "monthly-total-budget"
  budget_type  = "COST"
  limit_amount = "30000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = ["platform-team@company.com"]
  }

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 100
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["cto@company.com", "platform-team@company.com"]
  }
}

Step 8: Dev/Staging Environment Scheduling

Dev and staging environments don't need to run 24/7. We implemented automatic shutdown schedules:

  • Dev environments: Running 8am-8pm weekdays only (60% savings)
  • Staging: Running 6am-10pm weekdays, 10am-6pm weekends (50% savings)
  • Used AWS Instance Scheduler for EC2 and RDS
  • Set ECS desired count to 0 during off-hours via EventBridge + Lambda

Results Summary

Here's the final breakdown of savings:

  • EC2 right-sizing: -$8,400/month
  • Savings Plans: -$6,200/month
  • NAT Gateway optimization: -$4,200/month
  • CloudWatch log reduction: -$2,100/month
  • S3 lifecycle policies: -$1,800/month
  • Dev/staging scheduling: -$2,300/month

Total monthly savings: ~$25,000 (from $45K to ~$20K after all optimizations settled)

Monthly Cost Review Checklist

I now run this checklist monthly with every client:

  1. Review Cost Explorer for anomalies and trends
  2. Check Compute Optimizer for new right-sizing recommendations
  3. Audit untagged resources and assign ownership
  4. Review Savings Plan utilization and coverage
  5. Check for idle resources (unused EBS volumes, unattached EIPs, empty load balancers)
  6. Verify dev/staging schedules are working correctly
  7. Review data transfer costs for new optimization opportunities

Conclusion

AWS cost optimization isn't a one-time project - it's an ongoing practice. The key is building cost awareness into your team's culture and making it part of every architecture decision. Start with the quick wins (right-sizing, Savings Plans, NAT Gateway optimization) and build from there. The 40% savings we achieved wasn't magic - it was systematic analysis and disciplined execution.

If your AWS bill is higher than it should be, feel free to book a consultation - I'd love to help you find those savings.