Private Cloud

Sovereign infrastructure for regulated industries

We design, build, and operate private cloud environments for organizations where data sovereignty, compliance, and physical control are non-negotiable. From VMware to OpenStack, we deliver public-cloud agility on your own hardware.

VMware Certified ProfessionalRed Hat Certified Engineer
99.999%
Uptime Achieved
50+
On-Prem Clusters
< 4h
DR Recovery Time
100%
Compliance Score

Overview

Not everything belongs in public cloud. Data sovereignty laws, sub-millisecond latency requirements, regulatory mandates for physical control, and GPU density for ML training sometimes demand on-premises or colocation infrastructure that no hyperscaler region can satisfy. CloudForge builds private clouds that feel like public cloud to developers: self-service provisioning through a portal or CLI, automated horizontal scaling based on resource pressure, full observability with metrics, logs, and traces unified in a single pane, and infrastructure-as-code workflows that make the private cloud as reproducible as any Terraform-managed AWS account.

We use OpenStack, VMware Tanzu, and bare-metal Kubernetes to create enterprise private clouds tailored to the workload profile. OpenStack for organizations that need VM-based compute with software-defined networking and block storage. Tanzu for enterprises invested in the VMware ecosystem that want container orchestration without abandoning vSphere. Bare-metal Kubernetes with MetalLB, Calico BGP networking, and Rook-Ceph storage for workloads that cannot tolerate the hypervisor tax — GPU-dense ML training, latency-sensitive trading systems, and high-throughput data pipelines.

Our approach is not "keep the old datacenter running" — it is building cloud-native infrastructure that happens to run on your hardware. That means every server provisioned through Cluster API or MAAS, every configuration managed in Git with Ansible or Terraform, automated OS patching with rolling drain-and-cordon workflows, certificate rotation handled by cert-manager, and developer self-service portals built on Backstage or a custom platform engineering layer. The private cloud we build is one your engineers want to use, not one they work around.

Capabilities

VMware vSphere & NSX Management

Enterprise virtualization with software-defined networking, micro-segmentation, and DRS automation.

OpenStack Deployment

Production OpenStack clouds with automated day-2 operations and rolling upgrades.

Hybrid Cloud Integration (Azure Arc, AWS Outposts)

Unified management across on-prem and public cloud with consistent policy enforcement.

Hardware Lifecycle Management

Server provisioning, firmware updates, and capacity planning with automated workflows.

Compliance-Ready Infrastructure (ISO 27001, SOC 2)

Pre-audited infrastructure patterns with continuous compliance monitoring and evidence generation.

Disaster Recovery & Business Continuity

RPO/RTO-driven DR strategies with automated failover testing and runbook validation.

Architecture Patterns

Bare-Metal Kubernetes with MetalLB

Kubernetes clusters running directly on physical servers without a hypervisor layer. MetalLB provides Layer 2 or BGP-based load balancer IP allocation. Calico handles pod networking with BGP peering to physical switches for routable pod IPs. Node provisioning automated through Cluster API with IPAM integration for hardware inventory management.

When to use

Workloads requiring maximum compute density, GPU pass-through without virtualization overhead, or latency-sensitive applications where hypervisor scheduling jitter is unacceptable.

OpenStack Private Cloud

Full IaaS platform with Nova for VM compute, Neutron for software-defined networking with VXLAN overlays, Cinder for block storage backed by Ceph, and Keystone for identity federation with enterprise AD/LDAP. Heat templates provide infrastructure-as-code. Horizon dashboard offers self-service for development teams.

When to use

Organizations with existing VM-based workloads that need private cloud agility without refactoring to containers, or teams that require multi-tenant isolation at the infrastructure level with per-project quotas and network segmentation.

VMware Tanzu on vSphere

Tanzu Kubernetes Grid integrated with vSphere for container orchestration on existing VMware infrastructure. vSphere CSI driver provides persistent storage from VSAN or external arrays. NSX-T handles load balancing and micro-segmentation for pod traffic. vCenter provides unified management for both VMs and Kubernetes workloads.

When to use

Enterprises with significant VMware investment, existing vSphere operational expertise, and a goal of running Kubernetes workloads alongside traditional VMs without introducing a separate infrastructure stack.

Hybrid Cloud Bridge

On-premises Kubernetes cluster connected to public cloud Kubernetes (AKS, EKS, GKE) through a unified control plane. Azure Arc or Anthos manages policy, observability, and GitOps across both environments. Service mesh (Istio or Linkerd) handles cross-cluster service discovery and encrypted communication. Workload placement decisions based on data locality, cost, and latency requirements.

When to use

Organizations that need to keep sensitive data on-premises while bursting compute-intensive workloads to public cloud, or teams migrating incrementally from private to public cloud with workloads running in both environments during transition.

GPU-Dense Private Cloud

NVIDIA DGX or HGX systems managed as Kubernetes nodes with GPU Operator for driver lifecycle, device plugin for GPU scheduling, and MIG for GPU partitioning. RDMA networking with InfiniBand or RoCE for distributed training. Shared storage via NFS-over-RDMA or Lustre for training dataset access. Priority-based scheduling with preemption for batch training jobs.

When to use

ML teams that need dedicated GPU infrastructure for training large models, organizations with data sovereignty requirements that prevent using cloud GPU instances, or cost optimization when sustained GPU utilization exceeds 60% making reserved cloud instances more expensive than owned hardware.

Technical Deep Dive

Bare-Metal K8s Bootstrap

Cluster API (CAPI) with the Metal3 or Tinkerbell provider manages the full lifecycle of bare-metal Kubernetes clusters declaratively. MAAS (Metal as a Service) handles hardware discovery, PXE boot, OS provisioning, and IPAM. Each machine is represented as a BareMetalHost custom resource with firmware, BMC credentials, and hardware profile. CAPI MachineDeployments manage node pools with rolling update strategy — new nodes are provisioned and joined before old nodes are drained and decommissioned.

Best Practice

Use Cluster API for lifecycle management — not kubeadm directly. Kubeadm is a bootstrap tool; CAPI provides the declarative, reconciliation-based lifecycle (scaling, upgrades, repair) that bare-metal clusters need for day-2 operations. Maintain a management cluster that is separate from workload clusters.

Storage Architecture

Rook-Ceph deploys a distributed storage system on cluster nodes, providing block (RBD), filesystem (CephFS), and object (RGW) storage from a single platform. Storage classes define performance tiers: NVMe-backed SSD pools for databases requiring sub-millisecond latency, HDD pools for log archives and bulk data, and erasure-coded pools for cost-efficient object storage. NVMe-oF (NVMe over Fabrics) enables disaggregated storage for bare-metal workloads requiring shared block devices without the Ceph overhead.

Best Practice

Separate storage network from pod network — Ceph replication and recovery traffic can saturate cluster networking during OSD failures. Minimum 3 OSD nodes for data durability. Size Ceph MON nodes with SSD-backed storage for the MON database and avoid co-locating MONs with high-IO OSDs. Use bluestore on dedicated block devices, never on partitions.

Network Fabric Design

Calico in BGP mode peers with physical Top-of-Rack (ToR) switches to advertise pod CIDR ranges, making pod IPs routable from outside the cluster without NAT or overlay encapsulation. MetalLB allocates service LoadBalancer IPs from a configured pool and advertises them via BGP or ARP. For environments requiring network isolation between tenants, VXLAN overlays segment traffic at Layer 2 while BGP handles cross-segment routing at Layer 3.

Best Practice

Use BGP for production — it gives you routable pod IPs, eliminates overlay encapsulation overhead, and integrates cleanly with existing network infrastructure. VXLAN only when you need Layer 2 adjacency for legacy applications or strict multi-tenant isolation. Always dual-stack the management and data planes on separate physical interfaces.

Private Container Registry

Harbor provides an enterprise container registry with built-in vulnerability scanning (Triton/Clair), RBAC per project, image signing with Cosign and Notary v2, replication policies for multi-site synchronization, and garbage collection for storage reclamation. Proxy cache functionality mirrors public registries (Docker Hub, GCR, Quay) locally, eliminating external dependencies for image pulls during deployment.

Best Practice

Scan on push and enforce a policy that blocks deployment of images with Critical or High CVEs. Use replication policies to synchronize images between the primary registry and DR site registries. Configure proxy cache for all external registry dependencies to prevent outages when Docker Hub rate limits or goes down.

Day-2 Operations

Automated OS patching with kured (Kubernetes Reboot Daemon) detects pending kernel updates and coordinates node reboots with drain-and-cordon workflow to avoid workload disruption. Certificate rotation for kubelet client certs, etcd peer certs, and ingress TLS is handled by cert-manager with automatic renewal 30 days before expiry. Capacity planning monitors resource utilization trends and projects when additional nodes will be needed based on workload growth patterns.

Best Practice

Never patch all nodes simultaneously — use rolling updates with drain and cordon. Maintain at least N+1 node capacity so a single node failure during patching does not cause workload disruption. Schedule patching windows during low-traffic periods and test the patch on a canary node pool before rolling to production nodes.

Configuration Examples

Cluster API Manifest for Bare-Metal
yaml

Declarative cluster provisioning with Cluster API and the Metal3 provider. The manifest defines a Cluster resource with control plane and worker MachineDeployments. Each Machine references a BareMetalHost from the hardware inventory, specifying CPU, memory, and storage requirements. The KubeadmControlPlane resource configures the control plane with etcd encryption, audit logging, and OIDC integration. Worker MachineDeployments define node pools with auto-repair policies.

# Cluster API resources:
# ├── Cluster           — network CIDR, service CIDR, API endpoint
# ├── Metal3Cluster     — BMC credentials, IPAM pool reference
# ├── KubeadmControlPlane — 3 control plane nodes, etcd encryption
# ├── MachineDeployment  — worker pool with min/max node count
# ├── Metal3MachineTemplate — hardware profile (CPU, RAM, disk)
# └── BareMetalHost[]    — hardware inventory with BMC addresses
Rook-Ceph Storage Class
yaml

Tiered storage configuration with Rook-Ceph. SSD-backed pool for databases requiring low-latency IOPS, configured with 3x replication and crush rule targeting nodes with NVMe drives. HDD-backed pool for log archives and bulk storage with erasure coding (4+2) for cost-efficient durability. Each StorageClass references the appropriate Ceph pool and sets volume expansion, reclaim policy, and filesystem type.

# Storage tiers:
# ├── StorageClass: ssd-replicated
# │   └── CephBlockPool: 3x replication, NVMe crush rule
# ├── StorageClass: hdd-erasure-coded
# │   └── CephBlockPool: EC 4+2, HDD crush rule
# └── StorageClass: cephfs-shared
#     └── CephFilesystem: for ReadWriteMany workloads
BGP Configuration for Calico
yaml

Calico BGP peering configuration for routable pod IPs. BGPConfiguration resource sets the cluster AS number and specifies the pod CIDR to advertise. BGPPeer resources define peering sessions with each Top-of-Rack switch using their AS number and IP. IPPool resource defines the pod CIDR range with BGP mode enabled and NAT disabled for direct pod-to-external routing without encapsulation overhead.

# BGP resources:
# ├── BGPConfiguration  — clusterASN: 64512, advertise pod CIDR
# ├── BGPPeer[]         — ToR switch peers (AS 64513, 64514)
# ├── IPPool            — 10.244.0.0/16, natOutgoing: false
# └── NetworkPolicy     — default deny + allow rules per namespace

Use Cases

Financial Institution On-Prem Modernization

Legacy infrastructure refresh with containerization and software-defined networking.

Healthcare Data Sovereignty

Patient data stays on-prem with encrypted storage and audit-ready access controls.

Government Classified Workloads

Air-gapped infrastructure meeting ITAR, FedRAMP, and national security requirements.

Hybrid Burst to Public Cloud

On-prem baseline with automatic burst to Azure or AWS during peak demand periods.

Case Study

European Manufacturing Company

Challenge

500+ VMs on aging VMware infrastructure with manual provisioning taking 2-3 weeks per environment. Data sovereignty regulations prohibited public cloud for production workloads. Storage performance degrading as VSAN cluster approached capacity limits.

Solution

Migrated to bare-metal Kubernetes with Rook-Ceph storage across two colocation sites. Calico BGP networking integrated with existing physical switches. Harbor registry with vulnerability scanning for supply chain security. Backstage developer portal for self-service namespace and environment provisioning. Velero-based DR replication between sites with 15-minute RPO.

35% reduction
Infrastructure Cost
3 weeks → 12 minutes
Provisioning Time
4x improvement
Storage IOPS
99.995%
Platform Uptime

CloudForge gave us a private cloud that our developers actually want to use. Provisioning went from filing a ticket and waiting three weeks to clicking a button and getting a namespace in minutes. The infrastructure team went from firefighting to building platform features.

Director of IT, European Manufacturing Company

Tools & Technology Stack

VMware vSphereNSX-TOpenStackAnsibleTerraformPackerVault

Why CloudForge for Private Cloud

Our private cloud practice was built in datacenters, not cloud consoles. We have designed and deployed bare-metal Kubernetes clusters with Calico BGP networking, Rook-Ceph distributed storage, and MetalLB load balancing for organizations where public cloud was not an option. Our engineers hold CNCF Certified Kubernetes Administrator certifications, VMware Certified Professional credentials, and Red Hat Certified Engineer qualifications — we speak both the legacy VMware language and the cloud-native Kubernetes language fluently.

What distinguishes our approach is treating private cloud as a platform engineering problem, not an infrastructure procurement exercise. We do not just rack servers and install an OS — we build self-service developer platforms with GitOps-driven configuration, automated certificate rotation, rolling OS patching with zero-downtime guarantees, and Backstage-based service catalogs that give development teams the same experience they would expect from AWS or Azure. The private cloud we deliver is one developers choose to use, not one they are forced to use.

We design hybrid cloud bridges that connect private infrastructure to public cloud through unified control planes using Azure Arc, Google Anthos, or Rancher. This gives organizations the flexibility to keep sensitive workloads on-premises while bursting compute to public cloud, migrate incrementally without a big-bang cutover, and maintain consistent policy enforcement across both environments. Our goal is infrastructure that makes the private/public distinction invisible to application teams.

Learning Resources

ecosystem-overview

CNCF Cloud Native Landscape

The comprehensive map of cloud-native technologies — container runtimes, orchestrators, service meshes, observability tools, and storage solutions. Essential for understanding which components to select for a private cloud stack.

documentation

Cluster API Book

The official Cluster API documentation covering concepts, providers, and lifecycle management for declarative Kubernetes cluster provisioning. The foundation for any bare-metal Kubernetes deployment at scale.

documentation

Rook Documentation

Complete guide to deploying and operating Ceph storage on Kubernetes with Rook. Covers block, filesystem, and object storage, performance tuning, failure recovery, and multi-site replication.

github-repo

NVIDIA DeepOps

NVIDIA's reference deployment tool for GPU-accelerated Kubernetes clusters. Automates NVIDIA driver installation, GPU Operator deployment, and InfiniBand configuration for DGX and HGX systems.

Frequently Asked Questions

Build with Private Cloud

Our certified engineers are ready to design, build, and operate Private Cloud solutions tailored to your technical requirements.

Get Your Free Cloud Audit