Manufacturing & Industrial

Telecom Infrastructure Modernization — On-Premises Kubernetes Migration

CloudForge migrated a European telecom provider's bare-metal VM infrastructure to Kubernetes, building custom Go operators that enabled the existing ops team to self-manage clusters without external support, while achieving 80% faster update cycles and $60K/yr operational savings.

4h → 45 min (80%)
Update cycle
$60K/yr savings
Operational efficiency
50+, zero rollbacks
Deployments
99.5%
Uptime
16 weeks 2 engineers
KubernetesOn-PremisesCustom OperatorsBare Metal

A European telecommunications provider

The client is a European telecommunications provider operating critical network infrastructure that serves over 2 million subscribers across mobile, broadband, and enterprise connectivity services. Their infrastructure spans three geographically distributed data centres within a single country, hosting network management systems, billing platforms, subscriber provisioning services, and real-time traffic analytics workloads. The operational backbone of this infrastructure was a fleet of bare-metal servers and VMware virtual machines that had been accumulated over 7 years of organic growth, managed by an 8-person operations team using a combination of Python scripts, Bash automation, Ansible playbooks, and Java management tools.

The modernization imperative came from two converging pressures. First, routine platform updates — things as simple as deploying a configuration change or rolling out a security patch — required a 4-hour change window that involved executing Ansible playbooks sequentially across server groups, manually verifying each step, preparing rollback scripts, and coordinating with the network operations centre to monitor for service impact. The 4-hour window limited the team to two updates per week, creating a backlog of 30+ pending changes at any given time. Second, the company was preparing to launch 5G services, which would require deploying new network functions at a pace incompatible with the current 4-hour update cycle. Management had committed to Kubernetes as the target platform but faced a critical gap: the 8-person operations team had zero container or Kubernetes experience.

The vendor dependency concern was central to the client's decision-making. Previous modernization initiatives involving external consultants had created ongoing dependencies — the team would learn enough to operate the new system during the engagement, but when edge cases or unusual failures occurred, they had to call the consultants back at premium rates. Management explicitly stated that the outcome they valued most was not the Kubernetes migration itself but the operations team's ability to manage the platform independently after CloudForge's engagement ended. This requirement shaped every technical decision we made: we would not just deploy Kubernetes; we would build the tooling and transfer the knowledge needed for the ops team to own it completely.

The infrastructure supported workloads with distinctly different operational profiles and criticality levels. Network management systems monitored link health, traffic routing, and equipment status across the provider's physical infrastructure — these systems required sub-second response times and five-nines availability because a monitoring blind spot could delay fault detection during a network outage affecting thousands of subscribers. Subscriber provisioning services handled activation, plan changes, and number portability requests — batch-oriented during off-peak hours but requiring real-time processing during retail store operating hours when new customers were being onboarded. Billing integration workloads ran nightly reconciliation jobs against the financial systems and were subject to strict data integrity requirements mandated by the national telecommunications regulator. The regulatory environment added a compliance dimension to every technical decision: the regulator required documented change management procedures, auditable configuration histories, and demonstrated recovery capabilities that could restore service within defined time windows after any infrastructure failure. Any Kubernetes migration would need to satisfy these regulatory requirements from day one, not as a post-migration remediation item.

Legacy Operations at Scale with Zero Container Experience

The 4-hour update cycle was not a property of the technology — it was a property of the process. Each update followed a rigid sequence: the operations lead would draft a change request with a detailed list of affected servers, the change request would be reviewed by the network operations centre, an Ansible playbook would be selected or written for the specific change, the playbook would be executed against a staging group of servers while the team monitored real-time service metrics, verification would be performed manually (SSH into each server, check logs, verify service status), and only then would the playbook be executed against the remaining server groups. The rigidity of this process existed for good reason — a misconfigured network management system could disrupt service for hundreds of thousands of subscribers — but the process had not evolved as the infrastructure grew, meaning updates that would have taken 30 minutes on the original 10-server fleet now took 4 hours across 120+ servers.

The automation landscape was a case study in institutional knowledge trapped in ad-hoc scripts. The team maintained 47 Ansible playbooks, 23 Bash scripts, 15 Python utilities, and 8 Java management tools. These 93 separate automation artifacts had been written by different team members over a 7-year period, with varying coding styles, no consistent error handling, and minimal documentation. Several scripts had been written by engineers who had since left the company, and their behavior was understood only through tribal knowledge passed between team members. Three Ansible playbooks contained hardcoded IP addresses that referenced servers decommissioned two years earlier — the playbooks still "worked" because the unreachable hosts simply timed out, adding 3–5 minutes of silent delay to every execution.

Node management — adding new compute capacity or decommissioning retired hardware — was a particularly painful process. Adding a single bare-metal server to the operational fleet required a 14-step manual procedure spanning 2 days: physical rackmounting and cabling (day 1), OS installation and base configuration, network interface configuration and VLAN assignment, firewall rule updates, monitoring agent installation, joining the Ansible inventory, running the baseline configuration playbook, service-specific configuration, integration testing against the existing fleet, updating the CMDB (configuration management database), notifying dependent teams, and final sign-off from the operations lead. Removing a server was equally laborious, with the additional risk that services might still depend on the hardware through undocumented references in one of the 93 automation scripts.

Resource utilization across the fleet was strikingly poor. Average CPU utilization was 35% across all bare-metal and VM workloads, a consequence of the one-workload-per-VM deployment model. Each service — whether it consumed 5% or 50% of a server's capacity — was deployed on a dedicated VM or bare-metal host. This isolation model had been adopted for operational simplicity (a failing service could not impact its neighbours) but at the cost of massive compute waste. With 120+ servers, the 65% idle capacity represented significant capital expenditure on hardware that was doing nothing.

The skill gap was the most sensitive challenge. The 8-person operations team was experienced, competent, and deeply knowledgeable about their current infrastructure — but they had never worked with containers, orchestration systems, or declarative infrastructure management. Several team members were openly sceptical that Kubernetes was appropriate for telecom-grade workloads, citing concerns about networking complexity, persistent storage reliability, and the maturity of Kubernetes for latency-sensitive real-time systems. These concerns were legitimate and needed to be addressed with evidence rather than dismissed. Any migration strategy that left the team feeling dependent on external expertise would be rejected culturally, regardless of technical merit.

The 85+ automation scripts represented the most complex migration challenge because they encoded seven years of institutional knowledge in code that nobody fully understood. Of the 47 Ansible playbooks, 12 had not been modified in over 3 years and were effectively black boxes — they worked, but the logic behind specific configuration choices was undocumented and the original authors had left the company. The 23 Bash scripts ranged from 15-line convenience wrappers to 800-line orchestration scripts with nested conditionals, undocumented environment variable dependencies, and hardcoded paths to files that had been moved or renamed. The 15 Python utilities included a custom monitoring dashboard, a capacity planning calculator, and a change management tracker — each written by a different engineer with different coding conventions, different Python versions (2.7 through 3.9), and different approaches to configuration management. Version control was inconsistent: 31 of the 85 scripts existed only on the servers where they ran, with no repository backup and no change history beyond file modification timestamps. Migrating to Kubernetes meant not just replacing these scripts with manifests and operators, but first understanding what each script actually did — a reverse-engineering effort that consumed a significant portion of the planning phase.

Skills Assessment Plus Parallel Migration and Training Tracks

We began with a skills assessment that went beyond a generic "Kubernetes knowledge" survey. We mapped each team member's expertise against the specific capabilities they would need to operate the target platform: container debugging, YAML manifest management, kubectl operations, log aggregation analysis, Prometheus metric interpretation, and operator development fundamentals. The assessment revealed that while the team had no Kubernetes experience, they had strong Linux systems knowledge, solid networking fundamentals, and extensive experience with automation scripting — all transferable skills that would accelerate their Kubernetes learning curve. This assessment shaped the training curriculum to build on existing strengths rather than starting from zero.

The technical approach used Kubespray for on-premises Kubernetes deployment rather than a managed Kubernetes service (the client's on-premises requirement excluded cloud-managed options) or kubeadm (which would have required more manual configuration and offered less production-readiness out of the box). Kubespray provided a battle-tested Ansible-based deployment that could provision a production-grade cluster across the client's bare-metal servers, configure networking (Calico for network policies), set up etcd in HA configuration, and deploy core services (CoreDNS, MetalLB for bare-metal load balancing) — all through the Ansible framework the team already understood.

The migration followed a strangler-fig pattern: existing workloads continued running on their current VMs while containerized versions were deployed alongside them on the new Kubernetes cluster. Traffic was gradually shifted to the containerized versions, with the VM-based deployments remaining available as instant fallback. This approach eliminated the risk of a big-bang migration and allowed the team to gain operational experience with Kubernetes while still having the familiar VM infrastructure as a safety net. Each workload migration was a reversible operation — if any issue appeared, traffic could be shifted back to the VM version within minutes.

The most distinctive aspect of our approach was designing custom Go operators specifically so the ops team could self-manage the cluster without ongoing CloudForge dependency. Rather than configuring the cluster with complex Helm charts or custom controllers that only we understood, we built three purpose-built operators that encoded the team's existing domain knowledge into Kubernetes-native automation. The operators were designed to be readable, modifiable, and extensible by the team — written in straightforward Go with extensive inline documentation, comprehensive test suites, and architecture that mirrored the team's mental model of their infrastructure. The goal was that within 4 weeks of handover, any team member could modify an operator's behaviour without external assistance.

Production Kubernetes with Custom Operators and Comprehensive Knowledge Transfer

The Kubernetes cluster was deployed across bare-metal servers in two of the three data centres using Kubespray, with etcd running in a 5-node HA configuration spanning both sites. Calico provided the container networking layer with full network policy support, enabling pod-to-pod communication restrictions that replicated the network isolation the team had previously achieved through VLANs and firewall rules. MetalLB was configured for bare-metal load balancing, providing stable IP addresses for services that needed to be reachable from the existing non-containerized infrastructure. RBAC was configured from day one, with role bindings that matched the team's existing responsibility model — each engineer received permissions aligned with their operational scope, and a cluster-admin role was restricted to the two senior team leads.

The three custom Go operators were the centrepiece of the solution, designed to automate the most time-consuming and error-prone operational tasks while remaining fully under the team's control. The NodeManager operator automated the server lifecycle: when a new bare-metal server was added to the cluster, the operator detected it, ran configuration validation, applied baseline security policies, installed monitoring agents, and registered the node as available for workload scheduling — replacing the 14-step, 2-day manual process with a 15-minute automated workflow. When a server was marked for decommissioning, the operator cordoned it, drained workloads to healthy nodes, verified that no service was degraded, and updated the CMDB — all automatically, with human approval gates at each critical step.

The ServiceHealth operator implemented self-healing logic for telecom workloads. It monitored service health endpoints, correlated metrics with defined SLO thresholds, and took corrective action when degradation was detected: restarting unhealthy pods, scaling replica counts during traffic spikes, and triggering alerts when automated remediation was insufficient. The operator's health check logic was specifically designed around the team's existing monitoring practices — the same metrics they watched in Grafana dashboards were the metrics the operator evaluated programmatically. This made the operator's behaviour predictable and debuggable by team members who understood the underlying health criteria.

The ConfigSync operator automated configuration distribution across the cluster. When a configuration change was committed to the team's Git repository, the operator validated the change against a schema, applied it to the appropriate pods with a rolling update strategy, and verified that the new configuration took effect without service disruption. This replaced the Ansible-based configuration management workflow that had been the primary source of the 4-hour update cycle. ConfigSync respected the same staged rollout pattern the team was accustomed to — changes were applied to a canary group first, verified, and then rolled out to the remaining pods — but the entire process completed in minutes rather than hours because the verification and rollout steps were automated.

The automation consolidation was substantial. The 47 Ansible playbooks, 23 Bash scripts, and 15 Python utilities were replaced by 12 Kubernetes manifests and the 3 Go operators. The 8 Java management tools were replaced by kubectl commands and Grafana dashboards. The total codebase reduction was approximately 20% by line count, but the reduction in operational complexity was far greater — instead of 93 disparate automation artifacts with inconsistent behaviour, the team now had a unified operational model based on Kubernetes declarative state management and operator-driven automation.

The knowledge transfer programme occupied two full weeks and was structured as hands-on workshops rather than classroom lectures. Each of the 8 engineers completed a 40-hour curriculum covering Kubernetes fundamentals (architecture, scheduling, networking, storage), operational procedures (kubectl troubleshooting, log analysis, metric interpretation), operator development (Go basics, controller-runtime framework, reconciliation loops), and failure scenarios (node failures, network partitions, etcd quorum loss). The final assessment required each engineer to independently diagnose and resolve a simulated production incident — all 8 passed with scores that indicated certification readiness for CKA (Certified Kubernetes Administrator) examination.

The Kubespray deployment was tailored to the client's bare-metal environment with specific configuration choices driven by telecom workload requirements. Calico was selected as the CNI plugin not only for its network policy support but for its BGP peering capability, which allowed the Kubernetes cluster to integrate with the existing data centre network fabric without overlay networking overhead — critical for the network management workloads that required predictable sub-millisecond local network latency. Persistent storage was provisioned using local volumes on dedicated storage servers with Longhorn providing replication across data centre boundaries, ensuring that stateful workloads like the billing reconciliation database survived single-node failures without data loss. MetalLB was configured with a dedicated IP address pool per service tier — critical services received addresses from a high-priority pool with pre-configured firewall rules, while internal tooling used a separate pool with more relaxed access controls. The entire cluster configuration was codified in a Git repository that the operations team managed through pull requests, establishing the GitOps workflow that would eventually replace the ad-hoc Ansible-based change management process and satisfy the regulator's requirement for auditable configuration histories.

How We Delivered

1

Assessment & Planning

Weeks 1–3

Skills assessment for 8-person ops team. Inventory of 47 Ansible playbooks, 23 Bash scripts, 15 Python utilities, and 8 Java management tools. Workload dependency mapping across 120+ servers in 3 data centres. Migration sequence planning using strangler-fig pattern.

2

Kubernetes Cluster Deployment

Weeks 4–6

Kubespray-based deployment across bare-metal servers. 5-node HA etcd, Calico networking with network policies, MetalLB load balancing, RBAC with role-mapped permissions. Core services: CoreDNS, Prometheus, Grafana.

3

Operator Development

Weeks 7–10

Built three custom Go operators: NodeManager (automated server lifecycle), ServiceHealth (self-healing with SLO-based remediation), ConfigSync (Git-driven configuration distribution with staged rollout). Comprehensive test suites and inline documentation for each operator.

4

Workload Migration

Weeks 11–13

Strangler-fig migration of production workloads from VMs to Kubernetes. Traffic shifted gradually with VM-based fallback maintained throughout. Consolidated 93 automation scripts into 12 manifests + 3 operators. Validated 99.5% uptime target across all migrated services.

5

Team Training

Weeks 14–15

40-hour hands-on curriculum for 8 engineers covering Kubernetes fundamentals, operational procedures, operator development in Go, and failure scenario diagnosis. Final assessment: simulated production incidents requiring independent resolution.

6

Validation & Handover

Week 16

Team operated the platform independently for one week with CloudForge in advisory-only mode. Handled one unplanned incident (network policy misconfiguration) and one planned deployment without external assistance. Formal handover with runbooks and escalation paths.

Self-Sufficient Team Operating Modern Infrastructure

4h → 45 min (80%)
Update cycle
$60K/yr savings
Operational efficiency
50+, zero rollbacks
Deployments
99.5%
Uptime

Update cycles dropped from 4 hours to 45 minutes — an 80% reduction — with the majority of the remaining time consumed by the human approval gate rather than automated execution. The ConfigSync operator handled configuration distribution, validation, and staged rollout automatically; the 45-minute duration reflected the time for the operations lead to review the proposed change, approve it, monitor the canary deployment, and approve the full rollout. For routine changes (security patches, configuration updates), the entire process could be completed by a single engineer without coordinating with the network operations centre, because the operator's automated validation provided the same assurance that manual verification previously delivered.

Operational savings reached $60,000 annually, composed of reduced overtime (the 4-hour update windows frequently extended into evenings), elimination of two VMware licence renewals that were no longer needed, and reduced hardware procurement from improved resource utilization. CPU utilization improved from 35% to 72% through Kubernetes bin-packing — workloads that previously required dedicated VMs now shared compute resources efficiently, with resource limits and requests ensuring that critical services maintained performance guarantees. The improved utilization deferred the next hardware procurement cycle by an estimated 18 months.

The deployment track record was exceptional: 50+ deployments were executed in the first quarter post-handover with zero rollbacks required. The strangler-fig migration pattern and the operators' automated health verification meant that failing deployments were caught by the canary stage before they could impact production traffic. The 99.5% uptime target was met across all services during the migration period and exceeded (99.7%) in the first full quarter of Kubernetes-native operation. The only unplanned downtime event (14 minutes) was caused by an incorrectly configured network policy that blocked inter-pod communication for a non-critical internal analytics service — the ServiceHealth operator detected the issue and alerted the team, who resolved it by reverting the network policy change.

The team's self-sufficiency — the outcome management valued most — was validated within 4 weeks of handover. During this period, the team independently handled a bare-metal server failure (the NodeManager operator drained workloads automatically while the team replaced the hardware), deployed a new network function for the 5G preparation programme (using the standard manifest-plus-operator workflow without CloudForge involvement), and made a modification to the ServiceHealth operator's SLO thresholds (a Go code change, tested locally, reviewed by a peer, and deployed through the standard CI pipeline). The operations lead reported that the team's confidence in managing Kubernetes had progressed from apprehension to ownership within 6 weeks — faster than the original 3-month estimation.

The custom Go operators proved to be the key differentiator that previous consulting engagements had lacked. By encoding the team's domain knowledge — their health check criteria, their rollout procedures, their node provisioning steps — into operators that the team could read, understand, and modify, we transferred not just a technology but a capability. Six months after the engagement, the team had written a fourth operator independently (a BackupManager that automated etcd snapshot management), confirming that the knowledge transfer had been thorough enough to enable ongoing autonomous development.

The team transformation — from manual VM operators to Kubernetes-native platform engineers — was the most significant long-term outcome of the engagement. The cultural shift was visible in how the team described their own work: before the migration, they spoke about running playbooks and managing servers; after the migration, they spoke about declaring desired state and operating the platform. The CKA certification readiness confirmed during the final assessment translated into concrete qualifications: within 6 months, 5 of the 8 engineers had achieved CKA certification, and two pursued the CKS (Certified Kubernetes Security Specialist) qualification independently. Management reported that recruitment became significantly easier — the Kubernetes platform attracted candidates who would not have considered a role managing VMware and Ansible infrastructure, and the team filled two open positions within 6 weeks of posting compared to the 4-month average hiring cycle under the previous technology stack. The 5G network function deployment, which had been the original catalyst for the modernization initiative, proceeded on schedule using the standard Kubernetes manifest and operator workflow, with the operations team handling the deployment end-to-end without external assistance — delivering on management's primary success criterion of complete operational independence.

Tools & Platforms

Kubernetes (Kubespray)

On-premises cluster deployment with Ansible-based provisioning

Go Operators

Three custom operators: NodeManager, ServiceHealth, ConfigSync

Bare Metal

120+ servers across 3 data centres migrated from VMware VMs

RBAC

Role-based access control mapped to existing team responsibility model

Network Policies (Calico)

Pod-to-pod communication restrictions replacing VLAN isolation

Prometheus

Metrics collection powering ServiceHealth operator SLO evaluation

Grafana

Operational dashboards replacing custom Java management tools

etcd

5-node HA configuration spanning two data centres for cluster state

CoreDNS

Cluster DNS service replacing legacy internal DNS infrastructure

MetalLB

Bare-metal load balancer providing stable service IP addresses

Lessons Learned

1

Custom operators that encode domain knowledge eliminate vendor dependency. The three Go operators were not generic Kubernetes controllers — they were codified versions of the procedures the ops team had been performing manually for 7 years. Because the operators reflected the team's own mental model of their infrastructure, the team could understand, debug, and extend them independently. The proof came 6 months later when the team wrote a fourth operator without external assistance.

2

On-premises Kubernetes is viable but requires Kubespray or kubeadm expertise that most teams lack initially. The bare-metal deployment surface area — networking (CNI selection, MetalLB configuration), storage (local persistent volumes, CSI drivers), and cluster lifecycle (etcd backup, certificate rotation, version upgrades) — is significantly larger than managed cloud Kubernetes. Kubespray handled most of this complexity, but the team needed to understand the underlying components to troubleshoot production issues effectively.

3

Team training ROI exceeds infrastructure savings. The $60K annual operational saving was the quantifiable outcome, but the team's transformation from Kubernetes sceptics to independent platform operators was the outcome that changed the company's technical trajectory. Self-sufficient teams scale better than teams dependent on external expertise, because they can respond to novel situations without waiting for consultant availability.

4

Consolidating 93 automation scripts into Kubernetes manifests and operators reduced incidents by approximately 60% in the first quarter. The incident reduction was not primarily because Kubernetes is more reliable than VMs — it was because the consolidated automation eliminated the inconsistencies, hardcoded references, and undocumented behaviours that were the root cause of most operational incidents under the previous model. Fewer, better-structured automation artifacts mean fewer opportunities for misconfiguration.

We've worked with three different consulting firms on infrastructure modernization projects over the past 5 years, and every time we ended up calling them back within 6 months because our team couldn't manage the new systems independently. CloudForge was different — the custom operators they built are written in a way our team actually understands, and the training programme gave us the confidence to modify and extend them ourselves. We wrote our own fourth operator 6 months after handover. That's never happened before.
Lars Eriksson
Head of Network Operations, European Telecommunications Provider

Ready to Achieve Similar Results?

Every engagement starts with a conversation about your infrastructure challenges. Let's discuss how CloudForge can help.

Schedule a Consultation