A comprehensive infrastructure audit and modernization roadmap for a mid-market ERP provider spending $1.4M annually on Azure infrastructure across 42 VMs in 5 regions. We identified $959K–$1.36M in annual savings through six strategic initiatives targeting SQL Server infrastructure ($845K/year), IIS compute ($153K/year), backup overhead ($277K/year), and 123 TiB of storage at 59% waste — and designed a phased migration path from 150 manual RDP-based deployments to full CI/CD automation with AKS.
The client is a mid-market enterprise resource planning (ERP) software company headquartered in North America, serving over 40 enterprise customers across manufacturing, distribution, and professional services verticals. Their platform — an enterprise web application running on IIS backed by SQL Server — had been in production for over a decade, growing organically from a handful of servers to a fleet of 42 Azure VMs spanning 5 regions with per-customer isolation. Annual Azure billing had reached $1.43M ($119K/month) and was climbing steadily, while the engineering team had no visibility into what was driving costs or where savings opportunities existed. Infrastructure costs represented 30% of total customer revenue — a ratio that was already unsustainable at 40 customers and would become catastrophic at the client's growth targets of 100 customers within a year and 1,000 within five years.
The infrastructure broke down into two primary workloads. The SQL Server tier consisted of 23 VMs running SQL Server Standard on Azure IaaS, provisioned with 176 vCPUs and 1,408 GB RAM. Each customer had dedicated database instances ranging from 50 GB to over 500 GB, with tenant-specific customizations — stored procedures, reporting views, and integration endpoints accumulated over decade-long customer relationships. The application tier comprised 19 IIS VMs with 124 vCPUs and 976 GB RAM, hosting the enterprise web application. Both tiers ran on Premium SSD storage totaling 123 TiB, of which only 50 TiB was actually utilized — 59% of provisioned disk capacity was empty space the client was paying for every month.
Deployments were entirely manual. Each release required engineers to connect via RDP to individual servers, stop IIS services, copy updated application files, execute database migration scripts by hand, restart application pools, and manually verify functionality — a process repeated across 150 deployment targets per release cycle (50 customers multiplied by 3 environments each). There was no version control on customer-specific customizations, no rollback procedure beyond restoring from nightly backups, and no CI/CD pipeline of any kind. The company was actively growing toward 100 and then 1,000+ customers, and the current operational model simply could not scale.
Leadership recognized that their infrastructure was both a cost liability and an operational risk, but the internal team lacked the specialized cloud architecture expertise to diagnose the full scope of the problem. They had already committed to a 3-year Azure Savings Plan worth $91K/year, but this represented about 6% of their total spend — a rounding error against the systemic inefficiencies they suspected but could not quantify. They engaged CloudForge to perform a structured multi-domain assessment and deliver an actionable modernization roadmap that would address cost, automation, and scalability in a single cohesive plan.
The audit navigated significant internal dynamics. Each department — development, operations, finance, and customer success — had a different perspective on what the infrastructure problems were and where investment should be prioritized. The development team wanted CI/CD automation above all else. The operations team wanted to reduce the manual burden of managing dozens of VMs across five regions. The finance team wanted immediate cost reductions they could show to the board. Customer success wanted assurance that any changes would not affect the uptime guarantees in their enterprise SLAs. Several senior engineers were protective of architectural decisions they had made years earlier, interpreting the audit as an implicit critique of their work. We addressed this by framing every finding as a consequence of organic growth rather than poor decision-making — which was accurate, because most of the inefficiencies had accumulated gradually and were invisible until someone mapped the entire landscape end to end.
The infrastructure challenges facing this client were not isolated issues — they were deeply interconnected cost centers that had compounded over years of organic growth without architectural governance. Every optimization we identified in one domain revealed dependencies and constraints in two others, which is why a piecemeal approach would have failed. The total analyzed infrastructure cost was $1,048,518/year across SQL, IIS, and backup workloads, with an additional $382K in networking, security, and management services that were out of scope for immediate optimization.
SQL Server infrastructure represented the single largest cost center at $844,914 per year — far more than most stakeholders expected. The cost structure was dominated not by compute but by supporting infrastructure: Premium SSD disk storage accounted for $298,566/year, backup storage and operations added $264,000/year, SQL Server Standard licensing consumed $199,700/year (176 vCPUs at $1,135/vCPU/year), the actual VM compute cost was only $61,400/year, and Windows Server licensing added another $75,200/year. This cost decomposition was revelatory for the client — they had been focused on VM SKU prices while the real cost drivers were disks, backups, and licensing. The SQL VMs showed a distinctive utilization pattern: 15–30% average CPU with periodic spikes to 70–100% during maintenance windows. These spikes were caused by overlapping backup operations and maintenance jobs — reindexing, DBCC CHECKDB, statistics updates — all scheduled during the same windows because nobody had ever analyzed the interaction effects.
IIS compute added $153,404 per year, broken down into $110,000 in VM compute, $26,400 in Windows licensing, and $13,000 in backup costs. The 19 IIS VMs were dramatically overprovisioned: CPU utilization averaged 10–15%, memory utilization 30–40%, and disk utilization 30–50%. All 19 VMs ran 24/7 despite carrying no meaningful load outside business hours. Five non-production VMs that were only needed during business hours operated around the clock at an effective utilization rate of 27.4% — paying for 72.6% idle time.
Storage waste was pervasive across the entire fleet. The 123 TiB of provisioned Premium SSD storage showed only 41% utilization. That meant 72.58 TiB of premium-tier storage — the most expensive disk tier Azure offers — was provisioned, allocated, and billed every month but contained no data. Beyond empty provisioned space, a critical architectural flaw inflated storage costs further: the application stored documents directly in SQL Server databases. Individual customer databases could reach 500 GB, with documents that were read approximately 10 times over their entire lifetime consuming premium database storage instead of commodity blob storage. Migrating these documents to Azure Blob Storage represented an 86% cost reduction per gigabyte stored, but the migration touched application code, not just infrastructure.
Backup infrastructure operated as a hidden cost multiplier. The $277,000 annual backup spend represented 26.4% of all analyzed infrastructure costs. Local SQL backups were consuming 30–44% of each VM's provisioned disk capacity, meaning the client was paying Premium SSD prices to store backup files that should have been on lower-tier storage. The backup data footprint was 2–3x the actual database size due to full backup retention and transaction log accumulation. Critically, backup operations running during maintenance windows caused the CPU spikes that made the SQL VMs appear to need their current sizing. This created a dependency loop: you could not confidently rightsize the SQL instances until you migrated the backups off-VM, and the backup migration was itself a multi-week effort requiring careful validation.
On the automation front, the situation was equally stark. There was no CI/CD pipeline — zero. The deployment process consisted of 20+ manual steps executed via RDP, including copying files to production servers, manually editing configuration files, running SQL scripts in Management Studio, and restarting IIS application pools by hand. A single release across all customer tenants required approximately 150 manual deployments. The failure rate was high, with most failures caused by configuration drift between environments, missed migration steps, or IIS restart timing issues. For a company planning to scale from 40 to 1,000 customers, this operational model was an existential bottleneck.
The overprovisioning pattern was nearly universal. Of the 42 VMs in the fleet, 35 — 83% — were CPU-overprovisioned based on sustained utilization data. Average CPU utilization across IIS VMs was 10–15%, and across SQL VMs 15–30% outside of maintenance windows. The SQL VMs showed high memory utilization (89–97%), but this was almost entirely SQL Server buffer pool behavior — SQL Server is designed to consume all available memory for caching — and did not indicate actual memory pressure requiring larger VMs. The distinction between buffer pool memory consumption and genuine memory contention was critical for accurate rightsizing recommendations, and it was a nuance the client's team had not previously analyzed.
We approached this engagement as a structured infrastructure audit with a clear deliverable: a prioritized roadmap where every recommendation was backed by quantified cost impact, implementation effort in FTE-months, expected ROI, and risk classification. We were not interested in producing a generic "move to Kubernetes" pitch — we needed to understand the specific economics of this client's 42-VM, 5-region, per-customer-isolation architecture before prescribing anything.
The engagement kicked off with a structured discovery phase spanning the first two weeks. We conducted interviews with the VP of Engineering, senior developers, the operations lead, and the finance team responsible for Azure billing. These interviews were essential for understanding not just the technical landscape but the organizational constraints — who owned what, where institutional knowledge lived, and which changes would face internal resistance. We also needed to understand the growth model: at $35,773 per customer per year in infrastructure costs, the path from 40 to 100 customers would push annual Azure spend to $3.58M, and the path to 1,000 customers would reach $35.8M — numbers that made the board's cost concerns existential rather than incremental.
In parallel, we ran a comprehensive technical assessment across six domains: SQL Server infrastructure, IIS compute infrastructure, backup systems, CI/CD maturity, networking and security, and FinOps practices. For each domain, we collected utilization metrics spanning multiple weeks — CPU, memory, disk I/O, and network throughput at 5-minute granularity — to build accurate workload profiles that captured both steady-state behavior and peak patterns. We correlated Azure Cost Management billing data with Azure Monitor utilization metrics to calculate the effective cost-per-utilized-unit across every VM, disk, and backup vault.
The SQL analysis required particular care. SQL Server's buffer pool behavior means memory utilization is always 89–97% regardless of actual workload — the engine is designed to cache aggressively. We had to look past the headline memory number and analyze wait statistics, I/O patterns, and query execution metrics to determine whether the VMs were genuinely memory-constrained or simply exhibiting normal SQL Server behavior on appropriately-sized instances. Similarly, the 70–100% CPU spikes during maintenance windows were confounded by simultaneous backup operations — both backups and maintenance jobs (reindexing, DBCC CHECKDB, statistics updates) were scheduled in the same window, making it impossible to attribute the CPU load without isolating each contributor.
The output of this phase was not a slide deck — it was a structured dataset of findings across all six domains, ranked by annual cost impact, with explicit implementation sequences that respected cross-domain dependencies. The critical insight was phasing: backup migration had to precede SQL instance rightsizing because the backup-induced CPU spikes masked the true compute requirements. We prescribed a 30–60 day monitoring window after backup migration specifically to collect clean utilization data before making any rightsizing decisions. This dependency chain — our Phase A1.5 validation step — was the difference between confident recommendations and guesswork.
The final deliverable was a suite of 11 detailed reports, each addressing a specific infrastructure domain with findings, recommendations, implementation plans, and projected ROI. These were not executive summaries — each report was a working document that the client's engineering team could execute against directly. The reports covered: billing analysis, SQL infrastructure analysis, IIS infrastructure analysis, backup infrastructure analysis, utilization report, CI/CD architecture (v24 and v25 iterations with Infrastructure-as-Code), migration strategy, VMs-to-AKS analysis, Front Door analysis, containerization proof-of-concept, and an executive summary with roadmap and ROI projections.
The recommendations were organized into six strategic optimization initiatives, sequenced by risk, dependency, and time-to-value. Initiative 1 — Low-Hanging Fruit — addressed backup migration to Azure-native services, disk rightsizing from Premium SSD to Standard SSD where utilization warranted it, and scheduled start/stop for the five non-production VMs running 24/7 at 27.4% utilization. This initiative could be executed in 0–2 months with $40,500–$67,500 in implementation investment, delivering $182,921–$384,764 in annual savings at 170–470% first-year ROI with low risk. The backup migration alone removed the local backup files consuming 30–44% of each VM's Premium SSD capacity, eliminating the 2–3x data size overhead and enabling the disk capacity rightsizing that followed.
Initiative 2 — Instance Rightsizing — targeted the 83% of VMs that were CPU-overprovisioned. With backup operations migrated off the SQL VMs, the 30–60 day validation window would reveal the true compute requirements stripped of backup-induced CPU spikes. The projected savings were $137,567/year, with an implementation cost of $54,000–$81,000 and medium risk. The critical nuance was timing: rightsizing before backup migration would have produced incorrect SKU recommendations because the CPU profiles still included backup overhead. This phasing dependency was the single most important architectural insight of the entire engagement.
Initiative 3 — Blob Storage Migration — addressed the fundamental architectural flaw of storing documents in SQL Server databases. With customer databases reaching 500 GB and documents read approximately 10 times over their lifetime, the cost of keeping this data in SQL Server (with associated compute, licensing, backup, and Premium SSD costs) was orders of magnitude higher than Azure Blob Storage. The projected savings were $50,000–$100,000/year, but this initiative required application code changes and data migration tooling, placing it in the medium-risk category with an implementation cost of $67,500–$135,000.
Initiative 4 — AKS Migration — designed the containerization path for the IIS application tier. Moving from 19 dedicated IIS VMs to Azure Kubernetes Service would enable horizontal scaling, eliminate per-VM Windows licensing costs ($26,400/year), and unlock density improvements that reduce unit cost per customer. The projected savings were $98,318–$109,221/year with an implementation investment of $135,000–$270,000. We delivered a containerization proof-of-concept as part of the engagement to demonstrate feasibility and identify the specific application modifications required.
Initiative 5 — SQL Server Migration — represented the longest-horizon, highest-impact opportunity. Migrating from SQL Server Standard on IaaS VMs to Azure SQL Managed Instance or PostgreSQL would eliminate the $199,700/year in SQL licensing costs and the operational overhead of managing SQL Server patches, backups, and security at the VM level. The projected savings exceeded $275,000/year, but the 6–18+ month timeline and $270,000–$540,000 implementation cost reflected the complexity of migrating decades of tenant-specific stored procedures, views, and integration endpoints. This was classified as high risk and positioned as a strategic initiative rather than a quick win.
Initiative 6 — CI/CD and Infrastructure-as-Code — addressed the 150-deployment manual process that was the operational ceiling on growth. We designed a dual-track pipeline: one track for the legacy IIS-based deployment model (automated via GitHub Actions with parameterized deployment scripts replacing RDP-based manual work) and a parallel track for the eventual migration to AKS with Helm charts and ArgoCD for GitOps-based continuous delivery. The design included a compatibility matrix mapping which customer tenants could be migrated to AKS immediately versus which required legacy IIS support, and a phased evolution plan from the current 40 to 1,000+ customers. The critical path was 12–14 months minimum due to validation periods required for each customer migration cohort.
The CI/CD architecture was structured as a progressive modernization path rather than a single destination. Phase 1 automated the existing IIS-based deployment model: GitHub Actions workflows would replace the manual RDP-and-copy process, with parameterized deployment scripts targeting each customer environment based on configuration files stored in version control. This phase could be implemented immediately with no changes to the application architecture. Phase 2 introduced containerization, allowing the application to run on AKS alongside the legacy IIS deployments during the transition period. Phase 3 completed the migration to AKS with Helm charts per customer tenant and ArgoCD managing the GitOps desired-state reconciliation. Each phase was self-contained — the client could stop after Phase 1 and still have automated deployments, or proceed through all three phases based on their growth trajectory and timeline. The phased model acknowledged a reality we see in almost every enterprise engagement: modernization is a journey with multiple valid stopping points, not a binary switch from legacy to modern.
Across all six initiatives, the combined savings for Initiatives 1 through 4 (the 0–12 month horizon) totaled $468,806–$731,552 annually, with total implementation investment of $297,000–$553,500 and first-year ROI of 58–146%. The full roadmap including Initiatives 5 and 6 projected $959K–$1.36M in annual savings — a 67–95% reduction in the analyzed infrastructure spend — with a total implementation investment of $500,000–$750,000 over 12–18 months.
Stakeholder interviews, Azure environment access, billing data export, Azure Monitor metrics collection across all 42 VMs spanning 5 regions. Collected CPU, memory, disk I/O, and network metrics at 5-minute granularity.
SQL infrastructure cost decomposition, IIS utilization profiling, backup overhead analysis, disk utilization mapping (123 TiB provisioned, 41% utilized), VM rightsizing calculations, CI/CD maturity assessment.
Six-initiative optimization roadmap, dual-track CI/CD architecture (IIS legacy + AKS modern), AKS containerization POC, blob storage migration design, dependency sequencing with Phase A1.5 validation windows.
11-report deliverable with executive summary, per-initiative ROI projections, implementation investment estimates, and risk classifications. Hands-on walkthrough sessions with engineering and leadership teams.
The total identified savings ranged from $959K to $1.36M annually, representing a 67–95% reduction in analyzed infrastructure spend. The range reflects the difference between implementing only low-risk, high-confidence changes (Initiatives 1–2) versus executing the full six-initiative roadmap including AKS migration and SQL Server platform migration. The target state after Initiatives 1–4 would reduce total Azure billing from $1.43M to $699K–$962K per year — a 33–51% reduction in the total bill, with cost per customer dropping from $35,773 to $17,484–$24,052 annually and infrastructure as a percentage of customer revenue falling from 30% to approximately 15–20%.
The savings decomposition by domain revealed where the real leverage existed. SQL Server infrastructure — at $844,914/year — contained the largest absolute savings potential through the combination of backup migration ($264,000 in backup costs attacked directly), disk rightsizing (59% of 123 TiB was unused Premium SSD), instance rightsizing (83% of VMs overprovisioned), and the long-horizon SQL licensing elimination ($199,700/year). IIS compute savings came primarily from AKS migration ($98K–$109K/year) and non-production scheduling. The backup infrastructure, at $277,000/year and 26.4% of analyzed costs, was the single most impactful quick win — migrating to Azure-native backup services eliminated the local storage overhead, removed the CPU spike confound from maintenance windows, and enabled the disk and instance rightsizing that followed.
The three-year ROI projection modeled the compounding effect of these savings against the client's growth trajectory. Without optimization, scaling from 40 to 100 customers at the current $35,773 per-customer cost would push annual Azure spend to $3.58M. At 1,000 customers, the projection reached $35.8M per year — a figure that made the case for optimization self-evident. With Initiatives 1–4 implemented, the cost per customer would drop to $17,484–$24,052, meaning 100 customers would cost $1.75M–$2.4M (vs $3.58M without optimization) and the savings would compound with every new customer onboarded. The three-year cumulative ROI was projected at 200–400%, depending on the actual growth rate and the pace of initiative execution.
Beyond direct cost savings, the CI/CD architecture design transformed the client's deployment model from a 150-touch manual process to an automated pipeline that could scale to 1,000+ customers without proportional headcount growth. The security assessment identified critical gaps — shared credentials across environments, management endpoints exposed to the public internet, no RBAC on Azure resources — that would have been audit findings in any SOC 2 or ISO 27001 assessment. And the FinOps framework gave the finance and engineering teams a shared model for evaluating infrastructure investment decisions based on per-customer unit economics rather than aggregate monthly billing.
The client began implementation planning immediately after report delivery, prioritizing Initiative 1 (backup migration, disk rightsizing, non-production scheduling) as the lowest-risk, highest-ROI starting point. The structured dependency sequencing we prescribed — backup migration first, then a 30–60 day monitoring window, then instance rightsizing based on clean utilization data — gave the engineering team confidence that each step was grounded in evidence rather than estimation. The 11-report deliverable became the client's internal reference architecture, and the per-initiative ROI projections gave the finance team the business case framework they needed to secure board approval for the $500K–$750K implementation investment.
The engagement validated a pattern we see consistently in mid-market SaaS companies that have grown organically on IaaS: the largest cost drivers are not the obvious ones. The client's leadership assumed compute was the primary expense; in reality, disks, backups, and licensing collectively represented over 75% of SQL infrastructure costs while compute was only 7%. The backup overhead — invisible because it was distributed across per-VM storage costs rather than appearing as a single line item — accounted for more than a quarter of all analyzed spending. No amount of VM rightsizing could have addressed these structural cost drivers. The multi-domain assessment approach, analyzing each cost component at the line-item level rather than the VM level, was what made the $959K–$1.36M savings identification possible.
IaaS database infrastructure analysis — 23 VMs, 176 vCPUs, cost decomposition across compute, disks, licensing, and backups
Application tier analysis — 19 VMs, 124 vCPUs, utilization profiling and rightsizing recommendations
Storage utilization analysis — 123 TiB provisioned, 59% waste identified, tier optimization to Standard SSD
Target for document migration from SQL databases — 86% cost reduction per GB for infrequently-read documents
Billing analysis, per-VM cost decomposition, savings plan evaluation, per-customer unit economics modeling
Multi-week utilization metrics collection at 5-minute granularity for accurate workload profiling
Target container orchestration platform for IIS modernization — containerization POC delivered as part of engagement
CI/CD pipeline design for both legacy IIS automation and modern AKS deployment tracks
GitOps continuous delivery for the AKS deployment track with per-customer Helm charts
Infrastructure-as-Code design for reproducible, version-controlled infrastructure across all 5 regions
Backup infrastructure is a hidden cost multiplier that distorts every other metric. At 26.4% of analyzed costs ($277K/year), backup overhead was the single largest quick-win category — yet it was invisible in the client's cost reporting because it was distributed across per-VM disk provisioning rather than appearing as a discrete line item. Local SQL backups consumed 30–44% of each VM's Premium SSD capacity, inflating disk costs and creating CPU spikes during maintenance windows that masked the true compute requirements of the workloads.
VM-level cost analysis misses the structural drivers. The client assumed compute was their primary expense. In reality, SQL Server infrastructure costs broke down to: disks 35%, backups 31%, licensing 24%, compute 7%, Windows licensing 9%. No amount of VM SKU optimization could have addressed the 90% of costs that were not compute. Line-item decomposition across every cost component — not just the VM SKU price — is essential for accurate savings identification.
Dependency sequencing determines whether rightsizing recommendations are credible. SQL VMs showed 70–100% CPU spikes during maintenance windows, which would have justified their current sizing if analyzed in isolation. But those spikes were caused by simultaneous backup operations and maintenance jobs. Only after migrating backups off-VM and observing a 30–60 day clean monitoring window could we confidently determine the actual compute requirements. Prescribing rightsizing without this validation step would have been guesswork.
CI/CD ROI is measured in scalability ceiling, not just cost savings. The shift from 150 manual deployments to an automated pipeline did not reduce Azure spend directly — it removed the operational bottleneck that capped growth at the current customer count. For a company whose cost per customer needed to drop from $35,773 to $10,000–$15,000 to sustain growth from 40 to 1,000 customers, the CI/CD automation was the difference between a viable business model and an infrastructure cost wall.
“CloudForge didn't just tell us where to cut costs — they decomposed our entire infrastructure economics at the line-item level, from disk provisioning to SQL licensing to backup overhead. We thought compute was our biggest expense; turns out it was 7% of SQL costs while disks and backups were 66%. The six-initiative roadmap with dependency sequencing gave us confidence to start implementing immediately, and the backup quick wins alone justified the engagement within the first month.”
Insurance company running legacy Hadoop-style batch pipeline on always-on VM clusters at $7,500/month. Batch processing taking 6 hours, blocking business analytics. No cost visibility or optimization strategy.
Global enterprise with 500+ developers across 4 continents operating 15+ fragmented CI/CD configurations (GitLab CI, Jenkins). New service CI setup taking 2-3 days. Identity platform scaling issues with 12-second peak authentication latency affecting 50K+ users.
Every engagement starts with a conversation about your infrastructure challenges. Let's discuss how CloudForge can help.
Schedule a Consultation