CASE STUDY
Cloud Cost Optimization
How I reduced cloud spend by 35% (~$75K/year) through visibility, governance, automation, and continuous monitoring.
FinOps · Cloud Governance · AWS · Kubernetes · Monitoring
The problem
In this case study from my previous role, I share the full cloud cost optimization journey, from identifying spend drivers to implementing long-term controls.
As cloud adoption increased, cloud bills increased with it. Without active management, costs were beginning to outpace predictable planning and operational discipline.
The objective was not cost-cutting in isolation. The objective was to optimize cost while preserving performance, delivery speed, and scalability.

Why spend was increasing
As our cloud environments expanded, associated costs expanded rapidly as well. Teams were provisioning resources quickly to support delivery, but centralized spend control was not keeping pace.
In my department, roughly 50–60 professionals across engineering, QA, product, and business functions depended on cloud environments daily. Without unified governance, inefficiencies compounded.
- Individual and distributed testing environments were provisioned freely, creating resource sprawl.
- Environments were often left running overnight and on weekends.
- Higher-spec machines were requested even when lower classes were sufficient.
- Unused environments were not consistently decommissioned.
- In Kubernetes, we saw over-provisioned pods and nodes, orphaned persistent volumes, and misconfigured autoscaling.
These issues made cost forecasting difficult and placed increasing pressure on budget efficiency and operational governance.
Finding the cost drivers
To address escalating costs, I applied a structured approach combining data analysis, team collaboration, automation, and continuous governance.
Monthly spend averaged around $20,000 and was projected to rise toward $24,000 without intervention. The initial target was to reduce spend by at least 30%.
Data Collection and Analysis
We used AWS Cost Explorer, custom tagging policies, and Prometheus/Grafana dashboards to map high-spend resource groups and recurring inefficiencies.
Key inefficiency categories included:
- Idle databases
- Oversized or unused snapshots
- Outdated images and stale artifacts
- Off-hour runtime waste
- High network transfer patterns
- Over-provisioned instances and excessive node counts
- Orphaned persistent volumes
- Underused Spot opportunities

What we changed
With hotspots identified, implementation focused on targeted automation, policy enforcement, and recurring operational review loops. The actions below were executed as one program, not isolated fixes.
1. Idle Databases
Lambda workflows monitored database connection and CPU signals via CloudWatch. Development databases idle for sustained periods (for example, over 4 hours) were stopped automatically.
2. Oversized and Unused Snapshots
Snapshot lifecycle tagging (`usage=no_use_30`, `usage=no_use_60`) drove review and removal. Items inactive long enough (for example, around 75 days) were deleted after report-based verification.
3. Outdated Images
ECR lifecycle policies and image scanning removed deprecated container images. AMI hygiene was tightened with regular cleanup, patching, and benchmark alignment.
4. Resource Uptime and Off-Hour Usage
CloudWatch schedules with Lambda reduced non-production runtime at night and on weekends. Controlled override paths allowed critical environments to stay active when needed.
5. High Network Costs
We reduced unnecessary transfer costs by improving regional placement and limiting cross-region replication to critical traffic.
6. Data Retention and Release Hygiene
S3 lifecycle rules automated transitions and deletion by age and access pattern. Unwanted logs and stale release artifacts were removed, with colder data moved to lower-cost storage tiers.
7. Over-Provisioned Instances
Resource request patterns were normalized. Cases such as requesting 6 vCPUs where 4 were sufficient were corrected through provisioning guidance and review.
8. Excessive Kubernetes Node Counts
Cluster Autoscaler thresholds and scale-down timing were tuned. Pod requests and limits were standardized to improve packing efficiency and reduce over-allocation.
9. Orphaned Persistent Volumes
Scheduled checks flagged unattached EBS volumes for review and safe cleanup, reducing persistent storage waste.
10. Spot Instances
Spot was adopted for short-lived build workloads (often 15 minutes to 1 hour), with on-demand fallback and diversified fleets to maintain reliability.
Making savings sustainable
Sustained cloud savings require continuous monitoring and governance. Without operational follow-through, initial gains erode and waste patterns return.
Real-time dashboards and alerts helped detect anomalies early. Automated policy enforcement kept tagging, resource limits, and access controls consistent across teams and environments.
Key sustainability practices:
- Continuous monitoring with real-time dashboards and anomaly alerting
- Automated governance for tagging and resource usage rules
- Regular cost reviews with cross-functional stakeholders
- Training and culture-building around cost awareness
- Expanded lifecycle automation for recurring optimization tasks
- Agile adaptation of cost strategy as product and usage patterns evolve

Outcome and lessons learned
Cloud cost reduction is not a one-time activity. It requires an engineering culture that treats cost as an operational quality dimension alongside performance, reliability, and security.
Through automation, monitoring, and governance, we achieved substantial spend reduction while protecting delivery and platform stability.
In this portfolio case-study framing, the program delivered 35% cloud cost reduction, equivalent to roughly $75K/year. In the original source operating window, monthly spend movement from ~$20K toward ~$12K was also documented during active optimization cycles.
Key takeaways:
- Cloud cost management is a shared responsibility across engineering and leadership.
- Visibility and transparency align technical decisions with business objectives.
- Governance must be ongoing to prevent regression.
- Automation makes optimization repeatable and sustainable at scale.