Manufacturing & Industrial

Global EV IoT Platform — Dual-Region AKS & DevSecOps Hardening

CloudForge designed and deployed a dual-region AKS platform across Europe and Asia-Pacific for an EV IoT authentication system, achieving 90% CI build reduction and comprehensive DevSecOps hardening while maintaining zero production downtime.

45 min → 4.5 min

Docker build time

30% reduction ($54K/yr)

Observability cost

60% reduction

Idle VM spend

99% SLA

Platform availability

14 weeks 3 engineers

AKSIoTMulti-RegionDevSecOps

A global electric vehicle company

The client is a global electric vehicle manufacturer expanding aggressively into the Asia-Pacific market after establishing a strong presence across Western Europe. Their IoT platform authenticates and manages over 50,000 EV charging stations deployed across highways, commercial parking facilities, and residential complexes. Each charging session involves a multi-step authentication handshake between the vehicle, the charging station, and the cloud platform — a process that must complete in under 2 seconds to meet user experience targets. The platform handles 8 million authentication events per day at peak, with traffic patterns that follow commute schedules and show dramatic regional variation between European and Asian time zones.

Data sovereignty requirements were the primary driver for the engagement. GDPR mandated that European customer authentication data remain within EU boundaries, while China's PIPL and broader APAC data residency regulations required a separate deployment for the Asia-Pacific region. The existing single-region Azure deployment in EU-West served all global traffic, which meant APAC authentication requests traversed international network links with 180–220ms additional latency — a significant fraction of the 2-second budget. More critically, routing APAC customer data through EU infrastructure was approaching a compliance deadline that, if missed, would block the company's market entry into three countries.

Beyond the regional expansion, the platform suffered from several operational pain points that compounded the complexity of the project. Docker builds took 45 minutes per image, completely blocking the CI pipeline and creating a queue that delayed the 15-engineer development team by 6–7 hours of cumulative wait time per day. The observability stack — Datadog ingesting over 100 GB of telemetry data per month — cost $15,000 monthly, with no auto-scaling and no cost controls. A recent security audit had flagged critical gaps: shared service principal credentials across environments, public-facing Azure Functions with no private endpoint restrictions, Azure Key Vault deployed but not integrated with secrets rotation, and no managed identity strategy for workload-to-service authentication. CloudForge was engaged to address all of these concerns in a single coordinated engagement: deploy the dual-region infrastructure, harden the security posture, optimize CI build times, and bring observability costs under control.

Multi-Region Complexity Compounded by Security and Performance Debt

The Docker build bottleneck was the most immediately painful problem the development team faced. Each build produced a 4.2 GB container image that included the Go-based IoT authentication service, embedded TLS certificate bundles, protocol buffer definitions, and a set of legacy C libraries required for backward compatibility with older charging station firmware. Builds ran sequentially on a single GitHub Actions runner because the Dockerfile had never been restructured for parallelism — every layer depended on the previous layer, meaning no step could execute concurrently. With 15 engineers pushing an average of 8–10 times per day, the 45-minute build time created a persistent queue that delayed feature integration by hours. Developers had adopted the workaround of batching their commits and pushing once or twice per day, which reduced CI load but introduced its own problems: larger changesets, harder-to-bisect regressions, and merge conflicts from infrequent integration.

The dual-region deployment challenge went far beyond provisioning a second AKS cluster. The existing Terraform codebase was written as a monolithic configuration targeting a single Azure region. Resource names were hardcoded, region-specific parameters were embedded in module definitions rather than extracted as variables, and the state file was a single blob with no workspace separation. Replicating this infrastructure in a second region would have required either duplicating the entire Terraform codebase (creating a maintenance nightmare) or refactoring it into region-agnostic modules with parameterized inputs — the correct approach but one that required touching every resource definition in the existing configuration without breaking the production EU deployment.

Data sovereignty enforcement required more than geographic separation. Authentication tokens generated in Europe could not be valid in Asia-Pacific and vice versa — each region needed its own token signing keys, its own Key Vault instance, and its own certificate chain. Cross-region session migration (when a European EV owner charges their vehicle while traveling in Asia) required a secure token exchange protocol that transferred the minimum necessary data without replicating the full session state across regions. The existing authentication service had no concept of regional token scoping — every token was globally valid, which was a compliance violation waiting to happen.

The observability cost crisis stemmed from a common anti-pattern: logging everything at debug level in production. When the platform was small, the cost of debug-level logging was negligible. As the platform scaled to 50,000 charging stations and 8 million daily authentication events, the telemetry volume exploded. Analysis of the Datadog ingestion showed that 60% of the data was debug-level logs that served no operational purpose in production — verbose protocol traces, individual authentication step timestamps, and raw request/response payloads. The remaining 40% included genuinely useful operational telemetry, but it was buried in the noise. Alerts fired constantly because thresholds were set against absolute values rather than percentile-based metrics, creating alert fatigue that caused the operations team to ignore notifications — including, on one occasion, a genuine partial outage that went unnoticed for 47 minutes.

The security gaps identified in the audit were individually manageable but collectively represented a significant exposure surface. The shared service principal — a single Azure AD application registration used by all services across all environments — meant that a credential compromise in the development environment would grant access to production resources. Azure Functions processing IoT webhook events were publicly accessible with no private endpoint or network restriction, meaning they could be targeted directly from the internet. Key Vault contained secrets but had no rotation policy — several secrets had not been rotated in over 14 months. There was no managed identity strategy: services authenticated to Azure resources using client secrets stored in environment variables rather than using Azure Managed Identity, which would have eliminated stored credentials entirely.

The absence of any DevSecOps practices in the CI/CD pipeline meant that these security gaps were not being detected or prevented systematically. Container images were pulled from Docker Hub without vulnerability scanning. No SAST tool analyzed the codebase for common vulnerability patterns. No DAST tool tested deployed endpoints for runtime security issues. Pod security policies were not configured on the AKS clusters, meaning containers could run as root, mount host filesystems, and escalate privileges. The security audit had identified these gaps, but without automated detection integrated into the development workflow, new vulnerabilities were being introduced as fast as existing ones were being addressed.

Security-First Assessment with Parallel Execution Workstreams

We structured the engagement around a security-first assessment that would identify blast radius and exposure before any infrastructure changes were made. The rationale was straightforward: deploying a second region would double the attack surface, so hardening the security posture of the existing region first would prevent us from replicating security debt into the new deployment. Weeks 1–2 were dedicated to cataloging every security finding from the audit, classifying each by severity and exploitability, and designing remediation plans that could be executed in parallel with the infrastructure work.

The engagement was organized into four parallel workstreams that could proceed independently after the initial security assessment: build optimization (reducing Docker build times), multi-region infrastructure-as-code (refactoring Terraform for dual-region deployment), observability redesign (reducing costs while improving signal-to-noise ratio), and DevSecOps integration (embedding security scanning and enforcement into the CI/CD pipeline). Each workstream had a dedicated engineer and a clear set of deliverables with weekly milestones. The three-engineer team structure meant that one engineer could focus on infrastructure while another handled security and the third tackled build optimization and observability.

For the dual-region cutover, we designed a blue-green deployment strategy that would bring the APAC region online without any downtime in the EU region. The APAC cluster would be provisioned, configured, and tested with synthetic traffic before any real authentication requests were routed to it. DNS-based traffic management using Azure Traffic Manager would gradually shift APAC-originating requests to the new region over a 72-hour period, with automatic failback to the EU region if error rates exceeded thresholds. This approach meant the EU deployment remained the authoritative environment throughout the transition, reducing risk to zero for existing customers.

We adopted a principle of infrastructure parity: every resource deployed in EU-West would have an identical counterpart in the APAC region, provisioned from the same Terraform modules with region-specific variable files. This extended to observability (identical Grafana dashboards and alert configurations per region), security (identical network policies, pod security standards, and Key Vault configurations per region), and CI/CD (identical test matrices with region-specific endpoint targets). The parity principle meant that any operational procedure developed for one region would work identically in the other, eliminating the cognitive overhead of managing asymmetric deployments.

Dual-Region AKS with Hardened Security and Optimized Build Pipeline

The Docker build optimization delivered the most dramatic improvement in developer experience. We restructured the monolithic Dockerfile into a multi-stage build with BuildKit parallel execution. The first stage compiled the Go authentication service. The second stage, running in parallel, prepared the TLS certificate bundles and protocol buffer definitions. The third stage assembled the legacy C libraries. The final stage combined the outputs from all three stages into a minimal runtime image based on distroless. We configured Azure Container Registry (ACR) as the layer cache backend, meaning that unchanged layers were not rebuilt — only modified code triggered a rebuild of the affected stage. Additional optimizations included separating dependency resolution (go mod download) from code compilation, so that dependency layers were cached until go.mod changed. The cumulative effect reduced build times from 45 minutes to 4.5 minutes — a 90% reduction. We also pushed shared base images to ACR on a weekly schedule, further reducing the cold-start build time for the rare cases where the cache was invalidated.

The Terraform refactoring transformed the monolithic single-region configuration into a modular architecture with region-agnostic modules and region-specific variable files. Each Azure resource — AKS cluster, Key Vault, ACR, virtual network, network security groups, private endpoints — was defined in a reusable module that accepted region, naming prefix, and configuration parameters as inputs. The EU-West deployment was imported into the refactored modules using terraform import, preserving the existing state without recreation. The APAC deployment was provisioned from the same modules with different variable files specifying the target region, naming convention, and region-specific parameters such as Key Vault access policies and network CIDR ranges. Both regions shared a common Terraform backend in Azure Blob Storage with workspace-level state isolation, preventing cross-region state corruption.

The dual-region AKS architecture implemented private endpoints for every PaaS service: ACR, Key Vault, Azure Monitor, and Azure Service Bus. Each cluster ran with Azure CNI networking, network policies enforcing pod-to-pod communication restrictions, and pod security standards preventing privileged container execution. Key Vault per region stored region-specific token signing keys, TLS certificates, and service credentials — with automated rotation policies configured for 90-day cycles. The cross-region token exchange for traveling EV owners was implemented as a dedicated microservice that validated the originating region's token, extracted the minimum necessary claims, and issued a time-limited regional token that expired after the charging session completed. This design ensured that no persistent customer data crossed regional boundaries.

The observability redesign replaced Datadog with a self-managed stack built on Azure Monitor for metrics collection, Grafana for visualization, and a tiered log retention strategy. Hot retention (7 days) stored full-fidelity logs in Azure Monitor Logs for recent incident investigation. Warm retention (90 days) stored aggregated metrics and sampled logs in Azure Blob Storage for trend analysis. Cold retention (365 days) stored compressed archives for compliance audit requirements. Log level governance was enforced at the application level: production deployments automatically set log level to INFO, with DEBUG available only through a feature flag that expired after 4 hours. KEDA auto-scaling managed the log processing pipeline, scaling from 1 to 30 processor nodes based on ingestion volume — ensuring that log spikes during incident scenarios did not create processing backlogs while keeping steady-state costs minimal.

The DevSecOps pipeline integrated security scanning at every stage of the CI/CD workflow. Trivy scanned container images for known vulnerabilities before they were pushed to ACR, blocking any image with critical or high-severity CVEs from reaching the registry. SonarQube SAST analyzed every pull request for code quality issues and common vulnerability patterns, with quality gate enforcement preventing merges that introduced new issues. Azure Managed Identity replaced all service principal credentials — every workload authenticated to Azure resources using its pod-level managed identity, eliminating stored secrets entirely. KEDA autoscaling for event-driven workloads ensured that the Azure Functions processing IoT webhook events scaled based on Service Bus queue depth rather than running at fixed capacity. All Azure Functions were migrated to private endpoints, removing public internet exposure. The combined effect was a security posture that would have passed the most stringent enterprise security audit — a requirement the client anticipated as they pursued contracts with government-operated charging networks in Asia-Pacific.

How We Delivered

Security Assessment

Weeks 1–2

Comprehensive security audit across both existing and planned infrastructure. Cataloged all findings by severity and exploitability. Designed remediation plans for shared credentials, public endpoints, Key Vault gaps, and missing managed identity strategy.

Build Optimization

Weeks 3–4

Restructured Docker builds into multi-stage BuildKit pipeline with parallel execution. Configured ACR layer caching, separated dependency resolution from compilation. Reduced build times from 45 minutes to 4.5 minutes.

Region-1 Infrastructure (EU-West)

Weeks 5–7

Refactored monolithic Terraform into region-agnostic modules. Imported existing EU-West resources into modular state. Deployed private endpoints, network policies, and Managed Identity across all workloads. Migrated observability to Azure Monitor + Grafana with tiered retention.

Region-2 Replication (APAC)

Weeks 8–10

Provisioned APAC AKS cluster from identical Terraform modules with region-specific variables. Deployed regional Key Vault with independent token signing keys. Built cross-region token exchange microservice. Synthetic load testing against APAC endpoints.

DevSecOps Integration

Weeks 11–12

Integrated Trivy container scanning and SonarQube SAST into CI/CD pipeline. Configured KEDA autoscaling for event-driven workloads. Migrated all Azure Functions to private endpoints. Enforced pod security standards on both AKS clusters.

Validation & Handover

Weeks 13–14

Blue-green cutover of APAC traffic over 72 hours with continuous monitoring. Independent data sovereignty audit. Knowledge transfer to platform engineering team covering dual-region operations, security tooling, and observability management.

Dual-Region Platform Delivered on Schedule with Zero Downtime

45 min → 4.5 min

Docker build time

30% reduction ($54K/yr)

Observability cost

60% reduction

Idle VM spend

99% SLA

Platform availability

The Docker build optimization delivered a 90% reduction in CI build times, from 45 minutes to 4.5 minutes per image. For the 15-engineer team pushing 8–10 times per day, this eliminated approximately 6 hours of daily cumulative queue time. Developers returned to frequent small commits, which improved code review quality (smaller diffs are easier to review), reduced merge conflicts, and enabled faster regression bisection. The build optimization alone was cited by the engineering lead as the change that had the most immediate positive impact on team productivity and morale.

The dual-region deployment was completed within the 14-week timeline, with the APAC region accepting production traffic in week 12 after two weeks of synthetic load testing. The blue-green cutover proceeded without incident — APAC traffic was shifted over 72 hours with continuous monitoring, and at no point did error rates in either region exceed baseline levels. APAC authentication latency dropped from 180–220ms (routed through EU) to 18–25ms (served locally), bringing it well within the 2-second session budget. Data sovereignty compliance was verified by an independent audit that confirmed zero cross-region data leakage across authentication tokens, session data, and telemetry.

Observability costs dropped by 30%, saving approximately $54,000 annually. The transition from Datadog to the self-managed Azure Monitor and Grafana stack reduced per-GB ingestion costs, while the log level governance policy eliminated the 60% of data volume that was debug-level noise. The KEDA-managed log processing pipeline handled traffic spikes without manual intervention — during a simulated incident scenario, the pipeline scaled from 3 to 22 processor nodes within 90 seconds and returned to baseline within 10 minutes after the spike subsided. The tiered retention strategy met the compliance audit requirements while keeping hot storage costs minimal.

The DevSecOps integration transformed the security posture from reactive (audit findings addressed after discovery) to proactive (vulnerabilities caught and blocked before reaching production). In the first month after deployment, Trivy blocked 7 container images with critical CVEs that would have previously been deployed without review. SonarQube caught 23 code quality issues across pull requests, including 3 that were classified as security vulnerabilities. Managed Identity adoption eliminated all stored credentials from the platform — the client's security team confirmed that there were zero service principal secrets in any environment, a state they described as unprecedented in their infrastructure.

The platform achieved 99% SLA availability across both regions during the first quarter of dual-region operation, with zero unplanned downtime events. Idle VM costs were reduced by 60% through KEDA-driven scaling that matched provisioned capacity to actual demand. The regional infrastructure parity principle proved its value within the first month when a configuration change needed to be applied to both regions — the identical Terraform modules meant the change was a single PR affecting two variable files, deployed through the same pipeline, with identical validation steps. The client's platform engineering team confirmed they could operate both regions independently without CloudForge involvement, meeting the project's explicit goal of zero ongoing consulting dependency.

Tools & Platforms

AKS

Dual-region Kubernetes clusters with private endpoints and pod security standards

Azure Key Vault

Region-specific secrets management with 90-day automated rotation

Azure Container Registry

Private container registry with BuildKit layer caching

Terraform

Region-agnostic modules with workspace-level state isolation

KEDA

Event-driven autoscaling for log processors and IoT webhook functions

Managed Identity

Pod-level workload identity replacing all stored credentials

Trivy

Container vulnerability scanning blocking critical CVEs from deployment

SonarQube

SAST analysis with quality gate enforcement on pull requests

Grafana

Unified dashboards with per-region views and alert configurations

Azure Monitor

Metrics collection and tiered log retention (hot/warm/cold)

BuildKit

Parallel multi-stage Docker builds with aggressive layer caching

Private Endpoints

Network isolation for all PaaS services across both regions

Lessons Learned

Docker layer caching is the single highest-ROI CI optimization for most teams. The 45-minute to 4.5-minute reduction required no new tooling — just restructuring the existing Dockerfile for multi-stage builds with proper layer separation. The key insight is separating dependency resolution from code compilation: dependencies change rarely and should be cached aggressively, while application code changes frequently and should be in the final, thinnest layer. This pattern applies to virtually every language and framework.

Managed Identity eliminates credential rotation as an operational task entirely. The shift from service principal secrets to Azure Managed Identity removed an entire category of security risk — stored credentials cannot be leaked if they do not exist. The migration effort was modest (each workload required a Managed Identity assignment and a role binding), but the operational benefit was permanent: no more rotation schedules, no more expired credentials causing production incidents, no more secrets stored in environment variables.

Log level governance is an overlooked cost lever. Debug-level logs in production are pure waste — they consume storage, processing capacity, and analyst attention without providing operational value. The 60% volume reduction from enforcing INFO-level in production translated directly to cost savings and improved signal-to-noise ratio. The 4-hour expiring feature flag for temporary DEBUG access balanced the need for detailed diagnostics during incident investigation with the cost control requirement.

Region replication via Terraform modules scales linearly when designed modularly from day one. The refactoring effort was front-loaded in the EU region, but once the modules were parameterized, deploying the APAC region was a matter of creating variable files and running terraform apply. If the infrastructure had been modular from the beginning, the dual-region deployment would have been a one-week effort instead of a six-week effort. Every organization planning multi-region expansion should invest in modular Terraform architecture before the second region is needed, not after.

“CloudForge delivered our dual-region platform on a timeline that our internal team estimated would take 6 months. The security hardening was equally impressive — we went from a security audit with 14 open findings to a clean report in 14 weeks. But what surprised us most was the build optimization: cutting our Docker builds from 45 minutes to 4.5 minutes had more impact on developer productivity than any other single change we've made in three years. The team is faster, the platform is secure, and we hit our APAC compliance deadline with weeks to spare.”

Andreas Brandt

Director of Platform Engineering, Global EV Technology Company

Similar Engagements

SaaS & Technology

Multi-Tenant SaaS Deployment Automation

40-customer multi-tenant SaaS with 60% deploy success rate and 20+ manual steps per deployment. One engineer spending 100% of their time (~$75K/yr) on manual RDP/GUI releases. Platform needed to scale to 1,000+ customers without adding ops headcount.

Read case study

Manufacturing & Industrial

Telecom Infrastructure Modernization

Telecom provider running critical workloads on bare-metal and VM infrastructure with 4-hour update cycles, manual node management, and growing operational complexity. Operations team lacked Kubernetes expertise to adopt container orchestration.

Read case study

Ready to Achieve Similar Results?

Every engagement starts with a conversation about your infrastructure challenges. Let's discuss how CloudForge can help.

Schedule a Consultation