CloudForge replaced a legacy Hadoop-style batch pipeline running on always-on VM clusters with Azure Functions Flex Consumption, achieving 80% cost reduction ($72K/yr savings) and reducing batch processing time from 6 hours to 45 minutes while delivering real-time Power BI analytics dashboards.
The client is a mid-market insurance company processing over 200,000 claims annually across motor, property, and professional indemnity lines of business. Their data pipeline — the system responsible for ingesting raw claims data, validating it, enriching it with policy and actuarial information, and producing analytics outputs for underwriting and executive teams — had been built 5 years earlier using a Hadoop-inspired batch processing architecture running on Azure Virtual Machines. Four always-on D-series VMs ($7,500/month total) hosted a processing framework that read claims data from Azure Blob Storage, executed transformation and validation logic, wrote results to Azure SQL Database, and generated CSV exports consumed by the actuarial team's Excel models.
The batch processor ran nightly from midnight to 6 AM, producing results that were available to analysts when they arrived at 8 AM. This 6-hour processing window had been adequate when the company processed 50,000 claims per year, but at 200,000+ claims and growing, the window was approaching its capacity limit. More critically, the actuarial team had been requesting real-time analytics for over 18 months — the ability to see claims trends, reserve adequacy indicators, and loss ratio projections updated hourly rather than daily. The batch architecture made sub-hourly updates architecturally impossible: the pipeline ingested all pending claims at midnight, processed them sequentially, and produced a single output set. There was no mechanism for incremental or event-driven processing.
The client had attempted to modernize the pipeline once before. A previous vendor had proposed Azure Synapse Analytics as the target platform, estimating an 8-week implementation. The project was abandoned after 3 months: the Synapse implementation exceeded its budget by 60%, the team found the platform prohibitively complex to operate, and the resulting queries were slower than the existing batch pipeline for the specific analytical patterns the actuarial team relied on. The failed migration left the CTO and CFO deeply sceptical of any modernization proposal — they had allocated budget, invested time, and received nothing usable in return. CloudForge's engagement required convincing leadership that a second attempt would succeed where the first had failed, which meant proving ROI before touching infrastructure and choosing a technology stack the existing team could actually operate.
The always-on VM costs were the most straightforward problem to quantify but the hardest to justify operationally. Four D-series VMs consuming $7,500 per month ($90,000 annually) ran 24 hours a day, 7 days a week. The batch pipeline used them for 6 hours per night — midnight to 6 AM — meaning the VMs spent 75% of their time completely idle. The remaining 25% represented the actual processing workload. During business hours, the VMs served a secondary purpose as ad-hoc query servers for the data engineering team, but actual utilization during business hours averaged 8%, with occasional spikes to 30% during manual data investigations. The cost structure was fundamentally mismatched to the workload: a pay-per-second model would have reduced compute costs by 70–80%, but the batch architecture could not run on ephemeral infrastructure because the processing framework required persistent local storage and long-running processes.
The legacy batch architecture was technically functional but architecturally brittle. The processing framework used a custom ETL (Extract, Transform, Load) system written in Python, running as a single long-lived process on one of the four VMs. The process read all pending claims from a staging table in Azure SQL, applied 47 validation rules sequentially, enriched each record with policy data from a second database, calculated actuarial metrics, and wrote results to the production analytics tables. A failure at any point in this 6-hour sequence required restarting the entire batch from the beginning. There was no checkpointing, no incremental processing, and no parallelism — each claim was processed one at a time in insertion order. The ETL system had been written by a contractor who had since left the company, and while the current team could maintain it, they did not have the confidence to restructure it fundamentally.
The failed Synapse POC had created organizational trauma that shaped every decision we made. The previous vendor had presented Synapse Analytics as a modern, scalable replacement for the batch pipeline. The reality was that Synapse's serverless SQL pool performed poorly on the client's specific query patterns (highly selective queries against deeply nested claim structures), the dedicated SQL pool pricing exceeded the existing VM costs, and the data engineering team found the Synapse Studio interface and management model impenetrable after 3 months of effort. The project was cancelled with $45,000 spent and nothing usable delivered. The CFO specifically stated that any future modernization proposal must demonstrate ROI on paper before any infrastructure spending was authorized, and that the technology choice must be simple enough for the existing 3-person data team to operate without specialized training.
Data quality issues compounded the batch architecture's limitations. Approximately 4% of incoming claims had data quality errors — missing required fields, invalid date ranges, inconsistent policy references, or formatting issues. The batch pipeline had no dedicated validation layer; instead, validation rules were embedded in the transformation logic, meaning errors were discovered at varying points during the 6-hour processing window. An error in a record near the beginning of the batch could propagate through downstream calculations before being caught by a later validation rule. The team's only recourse was to inspect the batch logs after 6 AM, identify affected records, manually correct them, and re-process those records individually — a process that consumed 2–3 hours of analyst time each morning.
The absence of infrastructure-as-code made the existing environment fragile and undocumented. All four VMs, the Azure SQL databases, the storage accounts, and the network configuration had been provisioned through the Azure Portal over a period of years. Changes were tracked in a shared OneNote document that listed "what was changed and when" in chronological order but did not capture the current desired state of any resource. If a VM needed to be rebuilt from scratch — due to a disk failure, for example — the team would need to reconstruct its configuration from the OneNote history, a process estimated to take 2–3 days. There was no automated disaster recovery, no Terraform, no ARM templates, and no repeatability in the infrastructure layer.
Given the organizational context — a failed migration, a sceptical CFO, and a cautious CTO — we led with a cost/benefit analysis rather than an architecture proposal. During week 1, we built a detailed financial model comparing the current state (4 always-on VMs, $90K/year) against three alternative architectures: Azure Functions Consumption (pay-per-execution), Azure Functions Flex Consumption (pay-per-execution with reserved baseline), and Azure Container Instances (pay-per-second containers). We modelled each alternative against the client's actual workload profile: 200,000 claims per year, 47 validation rules per claim, average processing time per claim, and peak/off-peak volume patterns. The model showed that Azure Functions Flex Consumption would reduce compute costs by 78–83% depending on traffic patterns, with Flex Consumption providing the best balance of cost efficiency and cold-start performance.
We presented the financial model to the CFO and CTO in week 2 alongside a technology comparison that addressed the Synapse failure directly. Azure Functions was positioned as the opposite of Synapse: simple (each function does one thing), familiar (Python, the same language the existing pipeline used), transparent (per-execution billing visible in Azure Cost Management), and incremental (we could migrate one pipeline stage at a time rather than replacing everything simultaneously). The key slide showed that even if the migration delivered only 50% of the projected savings — accounting for conservative error margins — the annual saving would still exceed $36,000, paying for the engagement within 6 months. The CFO approved the project with the condition that a 30-day parallel run would validate the savings before the old pipeline was decommissioned.
The architecture design in week 3 decomposed the monolithic batch pipeline into event-driven Azure Functions. Each of the 47 validation rules became an independent function triggered by Azure Service Bus messages. The claim ingestion process was converted from a nightly batch read to an event-driven model: when a new claim record appeared in the staging table, a Service Bus message was published, triggering the validation pipeline. Validated claims triggered enrichment functions, which in turn triggered metric calculation functions, which wrote to the analytics database. This decomposition enabled parallelism (multiple claims processed simultaneously), incremental processing (each claim processed as it arrived rather than waiting for a nightly batch), and fault isolation (a failure in one function affected only that claim, not the entire batch).
The analytics layer was deliberately simple. We chose PostgreSQL as the analytics database — replacing Azure SQL's Premium tier for analytics workloads — and Power BI for visualization. PostgreSQL provided more than adequate query performance for the actuarial team's analytical patterns at a fraction of the cost, and Power BI's DirectQuery mode enabled dashboards that refreshed every 15 minutes against live data. This was a direct response to the Synapse failure: the actuarial team needed dashboards that updated frequently, not a complex data warehouse platform they could not operate. PostgreSQL + Power BI delivered 95% of the analytical capability Synapse had promised, at approximately 10% of the complexity and cost.
The Azure Functions Flex Consumption deployment replaced the 4 always-on D-series VMs with a serverless event-driven architecture that charged only for actual execution time. Each validation rule, enrichment step, and metric calculation was implemented as an independent Azure Function triggered by Service Bus messages. The Flex Consumption plan provided a reserved baseline of 1 instance for consistent low-latency response during business hours, with auto-scaling to 20 instances during peak claim submission periods (typically Monday mornings and month-end). Off-peak processing (evenings and weekends) scaled to zero instances, meaning the client paid nothing for compute during periods of inactivity — a fundamental change from the always-on model where $7,500/month was consumed regardless of workload volume.
The data validation pipeline was the most architecturally significant change. In the legacy batch system, a claim with a data quality error at step 3 of 47 would continue processing through steps 4–47 before the error was caught by a later validation rule, potentially corrupting downstream calculations. In the new event-driven model, each claim passed through all 47 validation functions before any enrichment or calculation occurred. A failure at any validation step immediately routed the claim to a dead-letter queue with a detailed error report, preventing propagation. The data engineering team built a simple error resolution interface that displayed dead-lettered claims, their specific error details, and one-click resubmission after correction. The 4% error rate — previously discovered hours after batch completion — was now caught in real-time, typically within 30 seconds of claim submission.
The PostgreSQL analytics layer replaced the previous Azure SQL Premium-tier analytics database. We deployed Azure Database for PostgreSQL Flexible Server with the Burstable tier, which provided adequate IOPS for the actuarial team's query patterns at approximately 15% of the Premium-tier SQL cost. The enriched claims data was written to PostgreSQL by the final-stage Azure Functions, with partitioning by claim year and line of business to optimize the most common analytical queries. Power BI dashboards connected to PostgreSQL via DirectQuery, refreshing every 15 minutes. The actuarial team received four dashboards: claims volume trends (real-time), reserve adequacy indicators (15-minute refresh), loss ratio projections (15-minute refresh), and data quality monitoring (real-time). These dashboards replaced the morning CSV exports that the team had been manually loading into Excel — a workflow they described as "the part of the job we look forward to least."
Full Terraform IaC was implemented for every Azure resource: Functions, Service Bus, PostgreSQL, Key Vault, Storage Accounts, Application Insights, and the network configuration. The Terraform state was stored in Azure Blob Storage with state locking to prevent concurrent modifications. All infrastructure changes were submitted as pull requests in the team's GitHub repository, reviewed by at least one engineer, and deployed through a GitHub Actions pipeline. This was a deliberate response to the OneNote-documented infrastructure the previous environment relied on — the Terraform codebase became the authoritative record of the infrastructure's desired state, and any drift from that state would be detected and corrected automatically.
The 30-day parallel run was structured to provide the CFO with unambiguous ROI evidence. During weeks 7–8, both the legacy batch pipeline and the new event-driven pipeline processed the same claims simultaneously. Results were compared nightly: claim counts, validation outcomes, enriched data values, and analytical outputs were verified for exact parity. The parallel run confirmed that the new pipeline produced identical results to the legacy system while processing claims in an average of 45 minutes rather than 6 hours (the 45-minute figure represents the longest claim's end-to-end processing time on a typical day, not a batch window). The cost comparison during the parallel run showed the new pipeline consuming $1,480 in Azure Functions compute versus $3,750 for the legacy VMs (half-month) — exactly in line with the projected 80% cost reduction.
Built detailed financial model comparing current VMs against Azure Functions, Container Instances, and Functions Flex Consumption. Presented ROI analysis to CFO and CTO, directly addressing the failed Synapse POC. Secured project approval with 30-day parallel run requirement.
Decomposed monolithic batch pipeline into event-driven Azure Functions triggered by Service Bus messages. Designed data validation pipeline with dead-letter error handling. Selected PostgreSQL + Power BI for analytics layer.
Implemented 47 validation functions, enrichment functions, and metric calculation functions on Azure Functions Flex Consumption. Configured Service Bus topics and subscriptions. Built dead-letter error resolution interface.
Deployed PostgreSQL Flexible Server with claim-year partitioning. Built four Power BI dashboards with DirectQuery: claims volume, reserve adequacy, loss ratio projections, and data quality monitoring. Full Terraform IaC for all resources.
Both legacy batch and new event-driven pipelines processing identical claims simultaneously. Nightly parity checks on claim counts, validation outcomes, and analytical outputs. Cost comparison validating projected 80% reduction.
Legacy VM decommissioning after parallel run validation. Knowledge transfer covering pipeline operations, Function monitoring, error resolution workflow, and Terraform-managed infrastructure changes.
The cost reduction was the headline result: monthly infrastructure spend dropped from $7,500 to $1,500, a saving of $72,000 per year (80% reduction). The Functions Flex Consumption plan charged an average of $1,200/month for compute, with the remaining $300 covering the PostgreSQL Flexible Server and Azure Service Bus messaging costs. During the 30-day parallel run, actual costs tracked within 3% of the financial model's projections, which gave the CFO sufficient confidence to authorize decommissioning the legacy VMs. The annual saving of $72,000, against an engagement cost that was a fraction of that figure, delivered a payback period of under 2 months.
Batch processing time was effectively eliminated as a concept. In the legacy model, claims accumulated during the day and were processed in a 6-hour nightly batch starting at midnight, with results available at 6 AM. In the new model, each claim was processed individually within seconds of submission. The 45-minute figure represents the end-to-end processing time for the longest claim on a typical day — a complex professional indemnity claim that triggered all 47 validation rules and required enrichment from multiple policy databases. The average claim processed in under 3 minutes. For the actuarial team, this meant dashboards showing current-day data rather than data that was 18–24 hours old — a capability they had been requesting for over 18 months.
Data quality improved measurably. The real-time validation pipeline caught errors within 30 seconds of claim submission, compared to hours later in the batch model. The error resolution workflow — dead-letter queue with detailed error reports and one-click resubmission — reduced the average time to correct a data quality issue from 2–3 hours to 15 minutes. The effective error pass-through rate (claims reaching the analytics layer with undetected errors) dropped from 4% to 0.1%, because the 47 validation functions were now executed unconditionally before any enrichment occurred, whereas the legacy batch pipeline had allowed partial processing before validation completed.
The infrastructure-as-code transformation was equally significant for operational resilience. Every Azure resource was codified in Terraform, reviewed via PR, and deployed through GitHub Actions. When the team needed to create a disaster recovery environment for their annual business continuity test, they achieved it in 45 minutes by running terraform apply against a second Azure subscription — a process that would have taken 2–3 days of manual Portal provisioning under the previous model. The Terraform codebase also served as living documentation: any question about the infrastructure — "what tier is the PostgreSQL server?", "what's the Service Bus message TTL?" — could be answered by reading the Terraform configuration rather than logging into the Azure Portal and navigating through resource settings.
The stakeholder dynamic shifted fundamentally. The CFO, who had been the most sceptical voice after the Synapse failure, became the engagement's strongest advocate after the parallel run validated the projected savings. The CTO reported that the simplicity of the architecture — Azure Functions processing messages from Service Bus, writing to PostgreSQL, displayed in Power BI — meant the data engineering team understood every component and could troubleshoot issues independently. No external expertise was required post-handover. Three months after the engagement, the team independently extended the pipeline to process a new line of business (cyber insurance claims) by adding 6 validation functions and a new Power BI dashboard — a change that took them 4 days to implement and deploy.
Event-driven compute with pay-per-execution and auto-scaling
Message-based pipeline orchestration with dead-letter error handling
Analytics database replacing Azure SQL Premium tier at 15% of the cost
DirectQuery dashboards refreshing every 15 minutes for real-time analytics
Full infrastructure-as-code with state locking and PR-based changes
Terraform state backend and claim document storage
CI/CD pipeline for infrastructure and function deployments
Secrets management for database credentials and API keys
Prove ROI on paper before touching infrastructure — especially after a failed migration. The CFO's trust had been broken by the Synapse POC, and no architecture diagram would have rebuilt it. The financial model, built on the client's actual workload data and validated against Azure pricing calculators, provided the concrete evidence needed to secure approval. The 30-day parallel run then converted projected savings into observed savings, completing the trust restoration. Any modernization proposal following a failed initiative must lead with quantified, verifiable economics.
Simpler is better: PostgreSQL + Power BI beat Synapse Analytics for 95% of use cases at one-tenth the complexity. The actuarial team's queries — filtered aggregations, time-series trends, ratio calculations — did not require a dedicated analytics warehouse. PostgreSQL handled these patterns efficiently, Power BI provided the visualization layer, and the team could operate both without specialized training. The 5% of use cases that genuinely required a warehouse-class platform could be addressed later if and when the need materialized — not pre-emptively at significant cost and complexity.
Event-driven beats batch for any workload with variable volume. The legacy batch pipeline processed 200,000 claims in a 6-hour nightly window regardless of whether 100 or 10,000 claims arrived that day. Azure Functions processed each claim individually as it arrived, scaling compute to match actual volume. On light days, costs were minimal. On peak days, the pipeline scaled automatically without human intervention. This elastic cost model is the fundamental advantage of event-driven architecture: you pay for work performed, not for capacity reserved.
A 30-day parallel run eliminates stakeholder risk perception entirely. Running both pipelines simultaneously — with nightly parity validation and real-time cost comparison — removed every objection the CFO and CTO had about the migration. The results were not projections or estimates; they were observed facts from processing real production data through both systems. By the time we proposed decommissioning the legacy VMs, the parallel run had already proven that the new pipeline was functionally equivalent, faster, and cheaper. The decision was self-evident rather than contentious.
“After the Synapse disaster, I told the team I would not approve another infrastructure project until someone could show me the numbers first. CloudForge led with numbers — a financial model built on our actual data that projected $72K in annual savings. Then they backed it up with a 30-day parallel run that proved the projection was right. The new pipeline processes claims in minutes instead of hours, our actuarial team finally has real-time dashboards, and we're saving $6,000 every month. More importantly, our data team operates the whole thing without any external help. That's what the Synapse project was supposed to deliver, and CloudForge actually did it — in half the time and a fraction of the cost.”
Legacy ERP provider with $1.4M hybrid infrastructure spend across Azure and on-premises Windows/Linux VMs. Zero automation—150 RDP-based deployments per release, no version control on customizations, 100% manual clickops. Rising costs with no visibility into optimization opportunities.
Healthcare SaaS provider spending $204K/year on infrastructure and CI/CD with sprawling environments, always-on CI runners at $1,600/month, and no cost attribution. Growing customer base but costs growing faster than revenue.
Every engagement starts with a conversation about your infrastructure challenges. Let's discuss how CloudForge can help.
Schedule a Consultation