Kubernetes

Container orchestration at any scale

We design, deploy, and operate Kubernetes clusters across cloud and on-prem environments. From single-cluster startups to multi-cluster federations, our CKA/CKAD/CKS-certified engineers build platforms that developers love and SREs trust.

CKA — Certified Kubernetes AdministratorCKAD — Certified Kubernetes Application DeveloperCKS — Certified Kubernetes Security Specialist
50+
Clusters Managed
3
CKA/CKAD/CKS Certs
10K+
Pods Orchestrated Daily
60%
Faster Deployments

Overview

Kubernetes is no longer just container orchestration — it has become the control plane for modern infrastructure. Service meshes handle inter-service communication, GitOps controllers manage deployment lifecycle, custom operators automate domain-specific operations, and developer portals abstract cluster complexity into self-service workflows. The CNCF ecosystem now spans over 1,000 projects, and the challenge has shifted from "how do we run Kubernetes" to "how do we build a platform on Kubernetes that makes developers productive and SREs confident."

CloudForge's Kubernetes practice is CKA, CKAD, and CKS certified. We have designed, deployed, and operated 50+ production clusters across EKS, GKE, AKS, and bare-metal environments. Our team has contributed to CNCF projects and maintains deep expertise across the ecosystem — from Istio service mesh to Crossplane infrastructure abstractions to Backstage developer portals. We do not treat Kubernetes as a destination; we treat it as the foundation for Internal Developer Platforms that reduce cognitive load and accelerate delivery.

Our approach is platform engineering, not cluster administration. We build opinionated platforms with golden paths for common workloads (stateless services, async workers, cron jobs, stateful databases) while preserving escape hatches for teams with specialized requirements. Every platform we deliver includes GitOps-driven deployment, namespace-level multi-tenancy with resource quotas, network policies for micro-segmentation, and full observability with Prometheus, Grafana, and distributed tracing. The goal is a platform that developers can deploy to in 5 minutes and SREs can debug at 3 AM.

Capabilities

Cluster Design & Multi-Tenancy

Namespace isolation, resource quotas, network policies, and RBAC for secure multi-team clusters.

Helm Chart Development

Reusable, versioned Helm charts with values overlays for dev, staging, and production environments.

GitOps with ArgoCD & Flux

Declarative deployments synced from Git with automated drift detection and rollback.

Service Mesh (Istio, Linkerd)

mTLS, traffic splitting, circuit breaking, and observability without application code changes.

Observability (Prometheus, Grafana, Loki)

Full-stack monitoring with custom dashboards, alerting, and log aggregation.

Platform Engineering (Backstage, Crossplane)

Internal developer portals and infrastructure abstractions that enable self-service provisioning.

Architecture Patterns

Multi-Cluster Federation

Workloads distributed across multiple K8s clusters in different regions or cloud providers. Cluster API manages lifecycle. Submariner or Skupper provides cross-cluster networking. KubeFed or ArgoCD ApplicationSets synchronize resource definitions across clusters.

When to use

Organizations needing geographic redundancy, multi-cloud portability, or isolation between tenants/business units that exceeds what namespace-level multi-tenancy can provide.

GitOps with ArgoCD ApplicationSets

ArgoCD manages the entire deployment lifecycle — sync, health checks, rollback. ApplicationSets generate Application resources from Git directory structure, cluster labels, or pull request events. Sync waves order dependent resources. Notifications trigger on sync failures via Slack or PagerDuty.

When to use

Any team deploying to Kubernetes that wants auditable, declarative deployments. Essential for regulated environments where every deployment must trace back to a Git commit with reviewer approval.

Service Mesh with Istio Ambient Mode

Istio ambient mode provides mTLS encryption and L4 authorization without sidecar proxies, eliminating per-pod resource overhead. Waypoint proxies handle L7 features (traffic splitting, retries, fault injection) only where needed. This hybrid model reduces mesh overhead by 60-80% compared to sidecar-based deployments.

When to use

Organizations needing zero-trust networking (mTLS everywhere) and traffic management but concerned about the resource cost and operational complexity of sidecar proxies in every pod.

Internal Developer Platform (Backstage + Crossplane)

Backstage provides the developer portal — software catalog, scaffolding templates, TechDocs, and plugin ecosystem. Crossplane provisions cloud resources as Kubernetes CRDs, composing managed resources (RDS, S3, CloudSQL) into simplified platform APIs that developers claim without cloud console access.

When to use

Platform teams supporting 10+ development teams who need self-service infrastructure provisioning, service catalog visibility, and standardized project templates without giving developers direct cloud IAM access.

GPU Scheduling with NVIDIA Operator + MIG

NVIDIA GPU Operator automates driver installation, device plugin registration, and GPU monitoring. Multi-Instance GPU (MIG) partitions A100/H100 GPUs into isolated slices for fine-grained sharing. Kueue manages job queuing and fair-sharing across teams with priority classes.

When to use

ML/AI teams running training and inference workloads that need GPU fractional sharing, job queuing, and cost allocation per team or project.

Technical Deep Dive

Multi-Tenant Cluster Design

Namespace isolation with ResourceQuotas (CPU, memory, storage, object count limits), LimitRanges (default and max container resources), and NetworkPolicies (deny-all default, explicit allow rules per namespace). Hierarchical Namespaces Controller (HNC) enables team-level inheritance — create a parent namespace with shared secrets and policies, child namespaces inherit automatically. OPA/Gatekeeper enforces admission policies: required labels, image registry restrictions, privilege escalation prevention.

Best Practice

Use Hierarchical Namespaces for team isolation rather than separate clusters. Separate clusters only when regulatory requirements mandate control-plane isolation or when blast radius concerns require independent etcd datastores. For 90% of multi-tenancy scenarios, namespace isolation with proper RBAC and NetworkPolicies is sufficient.

GitOps Pipeline Design

ArgoCD ApplicationSets with Git directory generators scan a config repo for environment directories, creating Application resources dynamically. Sync waves order deployments: wave 0 for namespaces and RBAC, wave 1 for ConfigMaps and secrets, wave 2 for deployments and services. Health checks validate custom resource readiness before proceeding. Automated rollback triggers on degraded health status, reverting to the last known-good commit.

Best Practice

Separate application config repositories from code repositories. Application code repos trigger CI (build, test, push image). Config repos contain declarative manifests and trigger CD (ArgoCD sync). This separation ensures deployment configuration is independently versioned, reviewed, and auditable without rebuilding application artifacts.

Service Mesh Implementation

Istio ambient mode deploys a per-node ztunnel for L4 mTLS encryption and authorization — no sidecar containers required. Waypoint proxies deploy per-namespace or per-service for L7 features: traffic splitting (VirtualService weighted routing), retries with budgets, circuit breaking (DestinationRule outlier detection), and request-level authorization (AuthorizationPolicy). Kiali provides service graph visualization with traffic metrics.

Best Practice

Start with strict mTLS and L4 authorization before adding L7 traffic management rules. Enabling mTLS in strict mode across the mesh immediately encrypts all inter-service traffic and provides service identity. Add VirtualService traffic rules only when you have a specific need — traffic splitting for canary deployments, fault injection for chaos testing.

Platform Engineering with Backstage

Backstage software catalog ingests service metadata from catalog-info.yaml in each repository. Scaffolder templates generate new services with pre-configured CI/CD, monitoring, and infrastructure. TechDocs renders markdown documentation alongside the service catalog. Custom plugins integrate with internal systems — on-call schedules, deployment history, cost dashboards. The Tech Radar plugin tracks organizational technology adoption.

Best Practice

Build golden paths, not golden cages. Provide Backstage templates for the common case — a typical web service gets a Helm chart, ArgoCD Application, Prometheus alerts, and Grafana dashboard automatically. But allow teams to deviate from templates when their requirements are genuinely different. Forced standardization breeds shadow infrastructure.

Kubernetes Security Hardening

Pod Security Standards (PSS) enforce three profiles: privileged, baseline, and restricted. OPA/Gatekeeper policies provide fine-grained control beyond PSS — image signature verification with Cosign/Sigstore, required security contexts, blocked host paths. Falco provides runtime threat detection by monitoring syscalls for anomalous behavior (unexpected shell spawns, sensitive file access, network connections to known-bad IPs). Network Policies enforce microsegmentation with deny-all default.

Best Practice

Enforce the restricted Pod Security Standard by default, audit first for 2 weeks to identify violations, then enforce. Exemptions should be documented and time-bounded. Combine PSS with OPA/Gatekeeper for policies that PSS does not cover — image registry whitelisting, required resource requests, annotation-based access controls.

Configuration Examples

ArgoCD ApplicationSet for Multi-Cluster GitOps
yaml

ApplicationSet using a Git directory generator to discover cluster configuration directories and a merge generator to overlay cluster-specific values. Each directory contains a values file with environment-specific settings (replicas, resource limits, feature flags). Sync waves ensure CRDs and namespaces deploy before workloads. The ApplicationSet watches the config repo main branch and automatically creates or deletes ArgoCD Applications as directories are added or removed.

# Config repo structure:
# ├── clusters/
# │   ├── prod-us-east/
# │   │   ├── values.yaml       — replicas: 5, resources: large
# │   │   └── kustomization.yaml
# │   ├── prod-eu-west/
# │   │   ├── values.yaml       — replicas: 3, resources: medium
# │   │   └── kustomization.yaml
# │   └── staging/
# │       ├── values.yaml       — replicas: 1, resources: small
# │       └── kustomization.yaml
# └── base/
#     ├── deployment.yaml
#     ├── service.yaml
#     └── kustomization.yaml
Helm Chart Values Overlay Pattern
yaml

Microservice Helm chart with a base values.yaml and per-environment override files. The chart defines deployment, service, ingress, HPA, PDB, ServiceMonitor, and PrometheusRule resources. Environment overlays control replica counts, resource requests/limits, ingress hostnames, and feature flags. CI pipelines use `helm upgrade --install -f values.yaml -f values-{env}.yaml` with the environment determined by the target cluster.

# Chart structure:
# charts/microservice/
# ├── Chart.yaml
# ├── values.yaml              — sensible defaults for all envs
# ├── values-dev.yaml           — 1 replica, debug logging
# ├── values-staging.yaml       — 2 replicas, info logging
# ├── values-prod.yaml          — 5 replicas, warn logging, PDB
# └── templates/
#     ├── deployment.yaml
#     ├── service.yaml
#     ├── ingress.yaml
#     ├── hpa.yaml
#     ├── pdb.yaml
#     └── servicemonitor.yaml
Crossplane Composition for Cloud Resources
yaml

Crossplane CompositeResourceDefinition (XRD) that exposes a simplified PostgresDatabase CRD to developers. The Composition maps this to managed resources: an RDS instance (or CloudSQL, or Azure Database) based on the provider label, a security group, a Kubernetes secret with connection details, and a ServiceMonitor for database metrics. Developers create a PostgresDatabase CR with size (small/medium/large) and the platform handles all cloud-specific implementation details.

# Developer creates:
# apiVersion: platform.cloudforge.dev/v1
# kind: PostgresDatabase
# spec:
#   size: medium          → maps to db.r6g.large RDS instance
#   region: eu-west-1
#   backup: daily
#
# Platform provisions:
# ├── RDS Instance (db.r6g.large, multi-AZ, encrypted)
# ├── Security Group (port 5432, VPC-only access)
# ├── K8s Secret (host, port, username, password)
# └── ServiceMonitor (CloudWatch metrics → Prometheus)

Use Cases

Microservices Migration from Monolith

Incremental decomposition with sidecar proxies and shared service mesh for gradual adoption.

Multi-Cluster Federation

Workload distribution across clusters and regions with unified policy and service discovery.

Developer Self-Service Platform

Backstage-powered portal with Crossplane compositions for on-demand environment provisioning.

GPU Workload Scheduling for ML

NVIDIA GPU Operator with fractional GPU sharing and priority-based scheduling for training jobs.

Case Study

European SaaS Company

Challenge

200+ microservices deployed via manual 2-hour release processes. Rollbacks required SSH access to production servers. No service-to-service encryption. Each deployment was a coordination event requiring 4 teams and a change advisory board.

Solution

Built a Kubernetes Internal Developer Platform on EKS with ArgoCD GitOps for declarative deployments, Istio ambient mode for zero-trust networking, and Backstage for service catalog and self-service onboarding. Helm charts with environment overlays replaced manual deployment scripts.

2 hours → 5 minutes
Deployment Time
20+
Self-Serving Teams
45 min → 30 seconds
Rollback Time
100%
Zero-Downtime Rollouts

We went from deployment being a stressful coordination event to something developers do ten times a day without thinking about it. The platform CloudForge built did not just speed up deployments — it fundamentally changed our engineering culture.

VP of Engineering, European SaaS Company

Tools & Technology Stack

kubectlHelmArgoCDFluxIstioPrometheusGrafanaCrossplaneBackstage

Why CloudForge for Kubernetes

Our Kubernetes team holds CKA, CKAD, and CKS certifications — the complete CNCF certification trifecta covering administration, application development, and security. We have designed, deployed, and operated 50+ production clusters across EKS, GKE, AKS, and bare-metal environments running workloads from regulatory-compliant financial services to GPU-accelerated ML inference. Our engineers have contributed to CNCF projects and maintain deep expertise across the ecosystem.

We practice platform engineering, not cluster administration. The difference: a cluster admin installs Kubernetes and hands over kubectl access. A platform engineer builds an Internal Developer Platform with golden paths for common workloads, self-service namespace provisioning, GitOps-driven deployments, and guardrails that prevent teams from creating security and cost problems. Our platforms include Backstage developer portals, Crossplane infrastructure abstractions, and ArgoCD GitOps — integrated into a coherent experience.

We are opinionated by design. After 50+ clusters, we know what works and what creates operational debt. We will not build a "flexible" platform that supports every possible configuration — we will build an opinionated platform that handles 90% of your use cases elegantly and provides escape hatches for the 10% that genuinely need custom handling. Our engagements include knowledge transfer — pair programming, architecture decision records, and runbook documentation — so your team operates independently after we leave.

Learning Resources

hands-on-learning

Kubernetes the Hard Way

Kelsey Hightower's canonical guide to bootstrapping Kubernetes from scratch. The best way to understand what managed K8s services abstract away — and why you should use them anyway.

ecosystem-overview

CNCF Landscape

Interactive map of the entire cloud-native ecosystem organized by category. Essential for understanding which tools solve which problems and identifying mature projects versus early-stage experiments.

book

Production Kubernetes

O'Reilly book covering production patterns for networking, storage, observability, security, and multi-tenancy. Goes beyond cluster setup into the operational concerns that determine production success.

design-patterns

Kubernetes Patterns

Design patterns for container-based distributed systems organized into foundational, behavioral, structural, and configuration categories. The Gang of Four for Kubernetes application architecture.

Frequently Asked Questions

Build with Kubernetes

Our certified engineers are ready to design, build, and operate Kubernetes solutions tailored to your technical requirements.

Get Your Free Cloud Audit