Sovereign infrastructure for regulated industries
We design, build, and operate private cloud environments for organizations where data sovereignty, compliance, and physical control are non-negotiable. From VMware to OpenStack, we deliver public-cloud agility on your own hardware.
Not everything belongs in public cloud. Data sovereignty laws, sub-millisecond latency requirements, regulatory mandates for physical control, and GPU density for ML training sometimes demand on-premises or colocation infrastructure that no hyperscaler region can satisfy. CloudForge builds private clouds that feel like public cloud to developers: self-service provisioning through a portal or CLI, automated horizontal scaling based on resource pressure, full observability with metrics, logs, and traces unified in a single pane, and infrastructure-as-code workflows that make the private cloud as reproducible as any Terraform-managed AWS account.
We use OpenStack, VMware Tanzu, and bare-metal Kubernetes to create enterprise private clouds tailored to the workload profile. OpenStack for organizations that need VM-based compute with software-defined networking and block storage. Tanzu for enterprises invested in the VMware ecosystem that want container orchestration without abandoning vSphere. Bare-metal Kubernetes with MetalLB, Calico BGP networking, and Rook-Ceph storage for workloads that cannot tolerate the hypervisor tax — GPU-dense ML training, latency-sensitive trading systems, and high-throughput data pipelines.
Our approach is not "keep the old datacenter running" — it is building cloud-native infrastructure that happens to run on your hardware. That means every server provisioned through Cluster API or MAAS, every configuration managed in Git with Ansible or Terraform, automated OS patching with rolling drain-and-cordon workflows, certificate rotation handled by cert-manager, and developer self-service portals built on Backstage or a custom platform engineering layer. The private cloud we build is one your engineers want to use, not one they work around.
Enterprise virtualization with software-defined networking, micro-segmentation, and DRS automation.
Production OpenStack clouds with automated day-2 operations and rolling upgrades.
Unified management across on-prem and public cloud with consistent policy enforcement.
Server provisioning, firmware updates, and capacity planning with automated workflows.
Pre-audited infrastructure patterns with continuous compliance monitoring and evidence generation.
RPO/RTO-driven DR strategies with automated failover testing and runbook validation.
Kubernetes clusters running directly on physical servers without a hypervisor layer. MetalLB provides Layer 2 or BGP-based load balancer IP allocation. Calico handles pod networking with BGP peering to physical switches for routable pod IPs. Node provisioning automated through Cluster API with IPAM integration for hardware inventory management.
Workloads requiring maximum compute density, GPU pass-through without virtualization overhead, or latency-sensitive applications where hypervisor scheduling jitter is unacceptable.
Full IaaS platform with Nova for VM compute, Neutron for software-defined networking with VXLAN overlays, Cinder for block storage backed by Ceph, and Keystone for identity federation with enterprise AD/LDAP. Heat templates provide infrastructure-as-code. Horizon dashboard offers self-service for development teams.
Organizations with existing VM-based workloads that need private cloud agility without refactoring to containers, or teams that require multi-tenant isolation at the infrastructure level with per-project quotas and network segmentation.
Tanzu Kubernetes Grid integrated with vSphere for container orchestration on existing VMware infrastructure. vSphere CSI driver provides persistent storage from VSAN or external arrays. NSX-T handles load balancing and micro-segmentation for pod traffic. vCenter provides unified management for both VMs and Kubernetes workloads.
Enterprises with significant VMware investment, existing vSphere operational expertise, and a goal of running Kubernetes workloads alongside traditional VMs without introducing a separate infrastructure stack.
On-premises Kubernetes cluster connected to public cloud Kubernetes (AKS, EKS, GKE) through a unified control plane. Azure Arc or Anthos manages policy, observability, and GitOps across both environments. Service mesh (Istio or Linkerd) handles cross-cluster service discovery and encrypted communication. Workload placement decisions based on data locality, cost, and latency requirements.
Organizations that need to keep sensitive data on-premises while bursting compute-intensive workloads to public cloud, or teams migrating incrementally from private to public cloud with workloads running in both environments during transition.
NVIDIA DGX or HGX systems managed as Kubernetes nodes with GPU Operator for driver lifecycle, device plugin for GPU scheduling, and MIG for GPU partitioning. RDMA networking with InfiniBand or RoCE for distributed training. Shared storage via NFS-over-RDMA or Lustre for training dataset access. Priority-based scheduling with preemption for batch training jobs.
ML teams that need dedicated GPU infrastructure for training large models, organizations with data sovereignty requirements that prevent using cloud GPU instances, or cost optimization when sustained GPU utilization exceeds 60% making reserved cloud instances more expensive than owned hardware.
Cluster API (CAPI) with the Metal3 or Tinkerbell provider manages the full lifecycle of bare-metal Kubernetes clusters declaratively. MAAS (Metal as a Service) handles hardware discovery, PXE boot, OS provisioning, and IPAM. Each machine is represented as a BareMetalHost custom resource with firmware, BMC credentials, and hardware profile. CAPI MachineDeployments manage node pools with rolling update strategy — new nodes are provisioned and joined before old nodes are drained and decommissioned.
Use Cluster API for lifecycle management — not kubeadm directly. Kubeadm is a bootstrap tool; CAPI provides the declarative, reconciliation-based lifecycle (scaling, upgrades, repair) that bare-metal clusters need for day-2 operations. Maintain a management cluster that is separate from workload clusters.
Rook-Ceph deploys a distributed storage system on cluster nodes, providing block (RBD), filesystem (CephFS), and object (RGW) storage from a single platform. Storage classes define performance tiers: NVMe-backed SSD pools for databases requiring sub-millisecond latency, HDD pools for log archives and bulk data, and erasure-coded pools for cost-efficient object storage. NVMe-oF (NVMe over Fabrics) enables disaggregated storage for bare-metal workloads requiring shared block devices without the Ceph overhead.
Separate storage network from pod network — Ceph replication and recovery traffic can saturate cluster networking during OSD failures. Minimum 3 OSD nodes for data durability. Size Ceph MON nodes with SSD-backed storage for the MON database and avoid co-locating MONs with high-IO OSDs. Use bluestore on dedicated block devices, never on partitions.
Calico in BGP mode peers with physical Top-of-Rack (ToR) switches to advertise pod CIDR ranges, making pod IPs routable from outside the cluster without NAT or overlay encapsulation. MetalLB allocates service LoadBalancer IPs from a configured pool and advertises them via BGP or ARP. For environments requiring network isolation between tenants, VXLAN overlays segment traffic at Layer 2 while BGP handles cross-segment routing at Layer 3.
Use BGP for production — it gives you routable pod IPs, eliminates overlay encapsulation overhead, and integrates cleanly with existing network infrastructure. VXLAN only when you need Layer 2 adjacency for legacy applications or strict multi-tenant isolation. Always dual-stack the management and data planes on separate physical interfaces.
Harbor provides an enterprise container registry with built-in vulnerability scanning (Triton/Clair), RBAC per project, image signing with Cosign and Notary v2, replication policies for multi-site synchronization, and garbage collection for storage reclamation. Proxy cache functionality mirrors public registries (Docker Hub, GCR, Quay) locally, eliminating external dependencies for image pulls during deployment.
Scan on push and enforce a policy that blocks deployment of images with Critical or High CVEs. Use replication policies to synchronize images between the primary registry and DR site registries. Configure proxy cache for all external registry dependencies to prevent outages when Docker Hub rate limits or goes down.
Automated OS patching with kured (Kubernetes Reboot Daemon) detects pending kernel updates and coordinates node reboots with drain-and-cordon workflow to avoid workload disruption. Certificate rotation for kubelet client certs, etcd peer certs, and ingress TLS is handled by cert-manager with automatic renewal 30 days before expiry. Capacity planning monitors resource utilization trends and projects when additional nodes will be needed based on workload growth patterns.
Never patch all nodes simultaneously — use rolling updates with drain and cordon. Maintain at least N+1 node capacity so a single node failure during patching does not cause workload disruption. Schedule patching windows during low-traffic periods and test the patch on a canary node pool before rolling to production nodes.
Declarative cluster provisioning with Cluster API and the Metal3 provider. The manifest defines a Cluster resource with control plane and worker MachineDeployments. Each Machine references a BareMetalHost from the hardware inventory, specifying CPU, memory, and storage requirements. The KubeadmControlPlane resource configures the control plane with etcd encryption, audit logging, and OIDC integration. Worker MachineDeployments define node pools with auto-repair policies.
# Cluster API resources: # ├── Cluster — network CIDR, service CIDR, API endpoint # ├── Metal3Cluster — BMC credentials, IPAM pool reference # ├── KubeadmControlPlane — 3 control plane nodes, etcd encryption # ├── MachineDeployment — worker pool with min/max node count # ├── Metal3MachineTemplate — hardware profile (CPU, RAM, disk) # └── BareMetalHost[] — hardware inventory with BMC addresses
Tiered storage configuration with Rook-Ceph. SSD-backed pool for databases requiring low-latency IOPS, configured with 3x replication and crush rule targeting nodes with NVMe drives. HDD-backed pool for log archives and bulk storage with erasure coding (4+2) for cost-efficient durability. Each StorageClass references the appropriate Ceph pool and sets volume expansion, reclaim policy, and filesystem type.
# Storage tiers: # ├── StorageClass: ssd-replicated # │ └── CephBlockPool: 3x replication, NVMe crush rule # ├── StorageClass: hdd-erasure-coded # │ └── CephBlockPool: EC 4+2, HDD crush rule # └── StorageClass: cephfs-shared # └── CephFilesystem: for ReadWriteMany workloads
Calico BGP peering configuration for routable pod IPs. BGPConfiguration resource sets the cluster AS number and specifies the pod CIDR to advertise. BGPPeer resources define peering sessions with each Top-of-Rack switch using their AS number and IP. IPPool resource defines the pod CIDR range with BGP mode enabled and NAT disabled for direct pod-to-external routing without encapsulation overhead.
# BGP resources: # ├── BGPConfiguration — clusterASN: 64512, advertise pod CIDR # ├── BGPPeer[] — ToR switch peers (AS 64513, 64514) # ├── IPPool — 10.244.0.0/16, natOutgoing: false # └── NetworkPolicy — default deny + allow rules per namespace
Legacy infrastructure refresh with containerization and software-defined networking.
Patient data stays on-prem with encrypted storage and audit-ready access controls.
Air-gapped infrastructure meeting ITAR, FedRAMP, and national security requirements.
On-prem baseline with automatic burst to Azure or AWS during peak demand periods.
500+ VMs on aging VMware infrastructure with manual provisioning taking 2-3 weeks per environment. Data sovereignty regulations prohibited public cloud for production workloads. Storage performance degrading as VSAN cluster approached capacity limits.
Migrated to bare-metal Kubernetes with Rook-Ceph storage across two colocation sites. Calico BGP networking integrated with existing physical switches. Harbor registry with vulnerability scanning for supply chain security. Backstage developer portal for self-service namespace and environment provisioning. Velero-based DR replication between sites with 15-minute RPO.
“CloudForge gave us a private cloud that our developers actually want to use. Provisioning went from filing a ticket and waiting three weeks to clicking a button and getting a namespace in minutes. The infrastructure team went from firefighting to building platform features.”
— Director of IT, European Manufacturing Company
Our private cloud practice was built in datacenters, not cloud consoles. We have designed and deployed bare-metal Kubernetes clusters with Calico BGP networking, Rook-Ceph distributed storage, and MetalLB load balancing for organizations where public cloud was not an option. Our engineers hold CNCF Certified Kubernetes Administrator certifications, VMware Certified Professional credentials, and Red Hat Certified Engineer qualifications — we speak both the legacy VMware language and the cloud-native Kubernetes language fluently.
What distinguishes our approach is treating private cloud as a platform engineering problem, not an infrastructure procurement exercise. We do not just rack servers and install an OS — we build self-service developer platforms with GitOps-driven configuration, automated certificate rotation, rolling OS patching with zero-downtime guarantees, and Backstage-based service catalogs that give development teams the same experience they would expect from AWS or Azure. The private cloud we deliver is one developers choose to use, not one they are forced to use.
We design hybrid cloud bridges that connect private infrastructure to public cloud through unified control planes using Azure Arc, Google Anthos, or Rancher. This gives organizations the flexibility to keep sensitive workloads on-premises while bursting compute to public cloud, migrate incrementally without a big-bang cutover, and maintain consistent policy enforcement across both environments. Our goal is infrastructure that makes the private/public distinction invisible to application teams.
The comprehensive map of cloud-native technologies — container runtimes, orchestrators, service meshes, observability tools, and storage solutions. Essential for understanding which components to select for a private cloud stack.
The official Cluster API documentation covering concepts, providers, and lifecycle management for declarative Kubernetes cluster provisioning. The foundation for any bare-metal Kubernetes deployment at scale.
Complete guide to deploying and operating Ceph storage on Kubernetes with Rook. Covers block, filesystem, and object storage, performance tuning, failure recovery, and multi-site replication.
NVIDIA's reference deployment tool for GPU-accelerated Kubernetes clusters. Automates NVIDIA driver installation, GPU Operator deployment, and InfiniBand configuration for DGX and HGX systems.
Our certified engineers are ready to design, build, and operate Private Cloud solutions tailored to your technical requirements.
Get Your Free Cloud Audit