infrastructureFebruary 15, 202612 min read

Cloud Cost Optimization: Stop Overpaying for Your MVP

Most startups waste 40-60% of their cloud budget. Here is a practical guide to right-sizing your infrastructure without sacrificing reliability.

cloudawscost-optimization

Cloud Cost Optimization: Stop Overpaying for Your MVP

Last quarter I audited a startup's AWS bill and found they were spending $2,800/month on infrastructure that could comfortably run for $900. They had a t3.2xlarge running a Node.js API that peaked at 12% CPU utilization. Three unused Elastic IPs sitting idle at $3.65/month each. An RDS Multi-AZ deployment for a database with 200 rows. A NAT gateway routing traffic that could have gone through a VPC endpoint for free. None of this was malicious or even particularly careless — it was the accumulation of "just provision something that works" decisions made during a sprint to ship.

This is the norm, not the exception. Most early-stage startups waste 40-60% of their cloud spend because nobody on the team has the time or incentive to optimize costs when there are features to ship. The problem compounds: once infrastructure is provisioned and working, nobody touches it.

The Biggest Cost Traps I Keep Seeing

Before diving into solutions, it helps to understand where the money actually goes. These are the traps I encounter most frequently when reviewing startup infrastructure.

Over-Provisioned Compute Instances

This is the single largest source of waste. Teams pick an instance size during initial setup — usually based on a blog post or a "just to be safe" mentality — and never revisit it. I have seen m5.xlarge instances running cron jobs that execute for 30 seconds every hour. The instance sits idle for 59.5 minutes, burning $0.192/hour ($140/month) to do essentially nothing.

The root cause is psychological: nobody wants to be the person who under-provisioned the server and caused an outage. So everyone rounds up. Twice.

# Check your EC2 instance CPU utilization over the past 2 weeks
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abcdef1234567890 \
  --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average Maximum

If your average CPU is below 20% and your peak is below 50%, you are almost certainly over-provisioned. Drop down one or two instance sizes, monitor for a week, and adjust.

Forgotten EBS Volumes and Snapshots

When you terminate an EC2 instance, the attached EBS volumes do not automatically delete unless you configured that at launch. This means terminated instances leave behind orphaned volumes that keep billing. I have seen accounts with dozens of unattached gp3 volumes, each costing $0.08/GB/month, totaling hundreds of dollars for storage nobody is using.

Snapshots are even sneakier. Automated backup scripts create daily snapshots with no lifecycle policy, so they accumulate indefinitely. A 500GB volume snapshotted daily for a year generates 365 snapshots, and while EBS snapshots are incremental, the costs add up to real money.

# Find all unattached EBS volumes
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
  --output table

# Find snapshots older than 90 days
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`2025-12-01`].{ID:SnapshotId,Size:VolumeSize,Date:StartTime}' \
  --output table

NAT Gateway: The Silent Budget Killer

NAT gateways cost $0.045/hour ($32.40/month) just to exist, plus $0.045 per GB of data processed. For a startup making frequent API calls to external services from private subnets, the data processing charges alone can exceed $100/month. I have seen NAT gateway charges account for 15-20% of a small startup's total AWS bill.

The fix depends on what traffic is flowing through the NAT:

Traffic to AWS services (S3, DynamoDB, SQS): Use VPC Gateway Endpoints (free for S3 and DynamoDB) or Interface Endpoints ($7.20/month each, but still cheaper than NAT for high-volume traffic).
Traffic to the internet from Lambda functions: Consider whether those functions actually need to be in a VPC. Many Lambda functions are placed in a VPC "for security" when they do not access any VPC resources.
Low-volume outbound traffic: A NAT instance on a t4g.nano ($3.02/month) handles modest traffic at a fraction of the NAT gateway cost.

Data Transfer Charges

AWS charges $0.09/GB for data leaving a region, and cross-AZ traffic costs $0.01/GB in each direction. These charges are invisible until they are not. A chatty microservices architecture spread across three availability zones, with services calling each other hundreds of times per second, can generate surprising data transfer bills.

The architecture fix: co-locate services that communicate frequently in the same AZ for non-critical workloads, or use internal load balancers with AZ affinity.

Right-Sizing: A Systematic Approach

Right-sizing is not a one-time exercise. It is something you should revisit quarterly. Here is the process I follow.

Step 1: Gather Utilization Data

AWS Compute Optimizer is free and provides right-sizing recommendations for EC2, EBS, Lambda, and ECS. Enable it and let it collect at least 14 days of data before acting on its recommendations.

For more granular analysis, CloudWatch metrics are your primary data source. The key metrics to watch:

EC2: CPUUtilization, NetworkIn/Out, MemoryUtilization (requires CloudWatch agent)
RDS: CPUUtilization, DatabaseConnections, FreeableMemory, ReadIOPS, WriteIOPS
ElastiCache: CPUUtilization, CurrConnections, BytesUsedForCache

Step 2: Categorize Workloads

Not every workload should be right-sized the same way. I categorize them into three buckets:

Steady-state workloads (API servers, databases): These need consistent capacity. Right-size by finding the smallest instance that handles peak load with 30% headroom. Use Reserved Instances or Savings Plans for the baseline.

Variable workloads (batch processing, build servers): These should use Auto Scaling Groups with appropriate scaling policies, or Spot Instances for fault-tolerant batch jobs.

Periodic workloads (cron jobs, scheduled reports): These should not run on dedicated instances at all. Move them to Lambda, Fargate Spot, or Step Functions.

Step 3: Act Incrementally

Never right-size everything at once. Drop one instance size at a time, monitor for a week, then decide whether to go further. The cost of a brief over-provisioning period is far less than the cost of a production outage caused by aggressive downsizing.

Reserved Instances vs Savings Plans vs Spot

This is where the real savings happen — 30-72% off on-demand pricing — but the commitment mechanics are confusing enough that many startups just avoid them entirely.

Reserved Instances (RIs)

You commit to a specific instance type in a specific region for 1 or 3 years. In exchange, you get 30-40% off (no upfront, 1 year) to 60% off (all upfront, 3 years). The catch: if you need to change instance types, Standard RIs are not flexible. Convertible RIs offer more flexibility at a slightly lower discount (around 45% for 3-year all-upfront).

When to use RIs: Databases and other workloads where the instance type and size are genuinely stable. RDS Reserved Instances are almost always worth it — databases rarely change size.

Savings Plans

Savings Plans are the modern alternative to RIs for compute. You commit to a dollar amount of compute per hour (e.g., $0.10/hour) for 1 or 3 years. The commitment applies automatically to EC2, Fargate, and Lambda usage, and it is flexible across instance families, sizes, OS, and regions.

When to use Savings Plans: Almost always preferable to EC2 RIs for compute workloads. Compute Savings Plans offer the best flexibility-to-discount ratio. Start with a commitment that covers your minimum baseline spend.

# Example: Calculating Savings Plan commitment
# Current on-demand spend: $500/month on EC2
# Minimum baseline (never drops below): $350/month
# Recommended Savings Plan commitment: $350/month = ~$0.48/hour
# Expected savings: ~30% on the committed amount = ~$105/month
# Remaining $150/month stays on-demand for flexibility

Spot Instances

Spot Instances offer 60-90% discounts but can be interrupted with 2 minutes notice. They are ideal for:

CI/CD build agents (use Spot Fleet with multiple instance types)
Batch processing jobs that can checkpoint and resume
Development and staging environments
Load testing

They are not suitable for production API servers, databases, or anything that cannot handle sudden termination.

# Example: GitHub Actions self-hosted runner on Spot
# This saves roughly 70% compared to GitHub-hosted runners
# for compute-heavy builds
Resources:
  SpotFleet:
    Type: AWS::EC2::SpotFleet
    Properties:
      SpotFleetRequestConfigData:
        IamFleetRole: !GetAtt SpotFleetRole.Arn
        TargetCapacity: 2
        AllocationStrategy: lowestPrice
        LaunchSpecifications:
          - InstanceType: c5.xlarge
            SpotPrice: "0.06"
          - InstanceType: c5a.xlarge
            SpotPrice: "0.06"
          - InstanceType: c6i.xlarge
            SpotPrice: "0.06"

Serverless Cost Models: Not Always Cheaper

There is a persistent myth that serverless is always cheaper for startups. It is often cheaper at low scale, but the economics shift as traffic grows.

Lambda Pricing Reality

Lambda charges per request ($0.20 per million) and per GB-second of compute ($0.0000166667). For a typical API endpoint that runs for 200ms with 256MB memory:

1,000 requests/day: ~$0.03/month. Effectively free.
100,000 requests/day: ~$3.00/month. Still very cheap.
1,000,000 requests/day: ~$30/month. Starting to compare with a small EC2 instance.
10,000,000 requests/day: ~$300/month. A t3.medium ($30/month) with proper caching handles this easily.

The crossover point depends heavily on execution duration and memory allocation, but for most API workloads, Lambda stops being cost-efficient somewhere between 1-5 million requests per day. Below that threshold, it is almost always the cheapest option.

API Gateway Costs Add Up

People forget that Lambda functions behind API Gateway incur additional charges. API Gateway costs $3.50 per million requests (REST API) or $1.00 per million (HTTP API). At scale, the API Gateway cost exceeds the Lambda cost. If you are running enough traffic to worry about Lambda costs, you should be using Application Load Balancer ($0.008 per LCU-hour) instead of API Gateway.

Fargate: The Middle Ground

AWS Fargate sits between Lambda and EC2 in both cost and operational complexity. You pay for vCPU and memory by the second, with no instance management. For workloads that need to run continuously but you do not want to manage servers, Fargate with Spot capacity providers offers a good balance.

A Fargate task with 0.25 vCPU and 0.5GB memory costs about $9/month running continuously. That is competitive with a t4g.nano and you get automatic placement, health checks, and scaling without managing the underlying host.

Monitoring and Alerting on Costs

You cannot optimize what you do not measure. Here is the monitoring stack I set up for every project.

AWS Cost Explorer

Enable Cost Explorer and set up a monthly budget with alerts at 50%, 80%, and 100% thresholds. This is the bare minimum. Cost Explorer's "Cost by Service" view immediately shows where money is going.

# Create a budget with email alerts
aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "Monthly-Total",
    "BudgetLimit": {"Amount": "500", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[{
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80,
      "ThresholdType": "PERCENTAGE"
    },
    "Subscribers": [{
      "SubscriptionType": "EMAIL",
      "Address": "team@startup.com"
    }]
  }]'

Infracost for Infrastructure as Code

If you use Terraform, Infracost integrates into your CI pipeline and shows cost estimates for every infrastructure change before it is applied. This is transformative — it shifts cost awareness from a monthly surprise to a pull request conversation.

# .github/workflows/infracost.yml
name: Infracost
on: pull_request
jobs:
  infracost:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}
      - name: Generate cost estimate
        run: |
          infracost breakdown --path=. \
            --format=json --out-file=/tmp/infracost.json
          infracost comment github \
            --path=/tmp/infracost.json \
            --repo=$GITHUB_REPOSITORY \
            --pull-request=${{ github.event.pull_request.number }} \
            --github-token=${{ secrets.GITHUB_TOKEN }}

Cost Anomaly Detection

AWS Cost Anomaly Detection uses machine learning to identify unusual spending patterns. It is free and sends alerts when spending deviates from historical patterns. Enable it for each service and set up SNS notifications to Slack.

Architecture Patterns That Save Money

Beyond right-sizing individual resources, certain architectural decisions fundamentally change your cost structure.

Caching Aggressively

A CloudFront distribution in front of your API for cacheable responses costs $0.085/10,000 HTTPS requests and $0.085/GB transfer — but it eliminates those requests from hitting your backend. For read-heavy APIs (which most are), a 90% cache hit rate means your backend handles 10x less traffic.

Even simpler: an in-memory cache like Redis or just a local LRU cache in your application can eliminate redundant database queries. I have seen database costs drop 60% after adding a 15-minute cache on the 5 most frequently accessed endpoints.

Use S3 Intelligently

S3 has multiple storage classes, and using the right one matters:

Storage Class	Cost/GB/month	Use Case
S3 Standard	$0.023	Active data, frequent access
S3 Infrequent Access	$0.0125	Backups, logs older than 30 days
S3 Glacier Instant	$0.004	Archives needing immediate access
S3 Glacier Deep Archive	$0.00099	Long-term archives, rare access

Set up S3 Lifecycle Policies to automatically transition objects between storage classes:

{
  "Rules": [
    {
      "ID": "TransitionToIA",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_INSTANT_RETRIEVAL"
        }
      ]
    }
  ]
}

Managed Services vs Self-Hosted

This trade-off is nuanced. Managed services (RDS, ElastiCache, OpenSearch Service) cost more per unit of compute than self-hosting on EC2, but they include patching, backups, failover, and monitoring. For a startup without a dedicated ops person, the engineering time saved almost always justifies the premium.

The exception: managed services at the highest tiers. A db.r6g.2xlarge RDS instance costs significantly more than the equivalent EC2 instance running PostgreSQL. Once you have an ops-capable team and are spending over $1,000/month on a single managed service, it is worth evaluating self-hosting.

Consolidate Where Possible

Running separate RDS instances for staging, QA, and development environments is expensive. Options:

Use a single RDS instance with separate databases for non-production environments.
Use Aurora Serverless v2 for dev/staging — it scales to zero when idle.
Run dev databases on a single EC2 Spot Instance with Docker.

A Real Cost Optimization Playbook

Here is the step-by-step process I follow when taking on a cost optimization project. This typically achieves 30-50% savings within the first month.

Week 1: Audit

Enable AWS Cost Explorer if not already active.
Pull a 3-month cost breakdown by service.
Identify the top 5 services by spend.
Run AWS Compute Optimizer and Trusted Advisor.
List all unattached EBS volumes, unused Elastic IPs, and idle load balancers.

Week 2: Quick Wins

Delete unused resources identified in the audit.
Right-size the most over-provisioned instances (1 size down, monitor).
Set up VPC endpoints for S3 and DynamoDB traffic.
Enable S3 Lifecycle Policies on all buckets.
Set up budget alerts.

Week 3: Commitments

Analyze steady-state workloads for Savings Plan eligibility.
Purchase Compute Savings Plans for baseline compute spend.
Purchase RDS Reserved Instances for production databases.
Convert suitable workloads to Spot Instances.

Week 4: Architecture

Add caching layers where appropriate.
Evaluate serverless migration for suitable workloads.
Consolidate non-production environments.
Set up Infracost in CI pipeline.
Document the cost baseline for ongoing monitoring.

What Not to Optimize

Cost optimization has diminishing returns, and some "savings" create more problems than they solve.

Do not sacrifice reliability for cost. Running a production database on a single-AZ deployment to save on Multi-AZ charges is a false economy. The first outage will cost more than years of Multi-AZ premiums.

Do not over-optimize for current scale. If your startup is growing 20% month-over-month, spending a week optimizing a $50/month service is not a good use of time. Focus on the big-ticket items.

Do not fight the cloud provider's pricing model. AWS is designed to be expensive if you fight its conventions. Use managed services the way they are intended, take advantage of free tiers, and leverage the commitment discounts they offer. Trying to build around pricing with overly clever architectures usually backfires.

Do not remove monitoring to save money. CloudWatch costs money, but flying blind costs more. The $15/month you spend on detailed monitoring will save you from the $500 surprise that detailed monitoring would have caught.

The Compound Effect

The biggest argument for early cost optimization is not the immediate savings — it is the habits and infrastructure you put in place. A startup that sets up budget alerts, uses Infracost in CI, and reviews costs monthly will maintain efficient infrastructure as they scale. A startup that ignores costs until series B will find themselves with $30,000/month AWS bills that take six months of dedicated effort to untangle.

Start with the audit. The numbers will tell you where to focus.

Related Projects

RestoHub

Restaurants stop losing 30% to Uber Eats — they get their own ordering, menu, website, and loyalty system in one platform. Full Uber Eats-style experience, but the restaurant keeps every dollar.

Danil Ulmashev

Full Stack Developer

Need a senior developer to build something like this for your business?

Book a discovery call Work