Hybrid/On-Prem GPU: The Boring GitOps Path
December 29, 2025·4 min read
Cloud GPU pricing is brutal at scale. If you have predictable, high-utilization workloads, on-prem or hybrid GPU infrastructure can cut costs 50-80%. But only if you can operate it reliably.
Here's the "boring" path: using Kubernetes and GitOps patterns to make on-prem GPU feel as reliable as managed cloud services.
When on-prem GPU makes sense
On-prem is worth considering when:
- Utilization is high: 60%+ average GPU utilization
- Workloads are predictable: Steady-state inference, not burst experimentation
- You have ops capacity: Team can handle hardware and Kubernetes
- Data locality matters: Compliance, latency, or bandwidth constraints
On-prem does not make sense for:
- Experimentation and R&D (use cloud spot instances)
- Burst workloads with unpredictable demand
- Teams without Kubernetes experience
The break-even point varies, but roughly: if you are spending $50k+/month on GPU cloud compute with high utilization, on-prem starts to pay off within 12-18 months.
Hybrid architecture overview
A typical hybrid setup separates GPU workloads from control plane and storage:
Key decisions:
- Control plane: Cloud-managed (simpler) or self-hosted (more control)
- Networking: VPN for dev, dedicated link for production
- Storage: Local NVMe for models, object storage for artifacts
Hardware selection: NVIDIA vs AMD
NVIDIA (CUDA)
| Pros | Cons |
|---|---|
| Best ecosystem and tooling | Higher cost |
| Widest model compatibility | Supply constraints |
| Mature drivers | Vendor lock-in risk |
| Extensive documentation | Less price competition |
Popular choices: A100 (80GB), H100, L40S
AMD (ROCm)
| Pros | Cons |
|---|---|
| 30-50% lower cost | Smaller ecosystem |
| Growing rapidly | Some model compatibility gaps |
| Open-source stack | Less documentation |
| Good for inference workloads | Requires validation effort |
Popular choices: MI300X, MI250, RX 7900 XTX (consumer, dev only)
For production inference with popular models (LLaMA, Mistral, etc.), both work. NVIDIA is the safe choice; AMD is the cost-optimized choice if you are willing to validate compatibility.
Kubernetes GPU scheduling basics
Kubernetes handles GPU scheduling via:
- Device plugins: Expose GPUs to the scheduler (NVIDIA, AMD provide these)
- Resource requests/limits: Pods request
nvidia.com/gpu: 1oramd.com/gpu: 1 - Node selectors/taints: Route GPU workloads to GPU nodes
Installing the NVIDIA device plugin
# Add the NVIDIA Helm repo
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
# Install the device plugin
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--namespace kube-system \
--set gfd.enabled=true \
--set deviceListStrategy=volume-mounts
Installing the AMD device plugin
# Apply the AMD device plugin
kubectl apply -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
# Verify GPUs are visible
kubectl get nodes -o json | jq '.items[].status.allocatable["amd.com/gpu"]'
GPU pod example
apiVersion: v1
kind: Pod
metadata:
name: inference-gpu
spec:
containers:
- name: model
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: 1 # or amd.com/gpu: 1
env:
- name: CUDA_VISIBLE_DEVICES
value: '0'
nodeSelector:
accelerator: nvidia # or amd
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Multi-tenant GPU scheduling
For shared clusters, add quotas and priorities:
# Resource quota per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: team-ml
spec:
hard:
requests.nvidia.com/gpu: '4'
limits.nvidia.com/gpu: '4'
---
# Priority class for critical inference
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: inference-critical
value: 1000000
globalDefault: false
description: 'Critical inference workloads'
GitOps for inference: Flux patterns
GitOps means: the desired state of your cluster is defined in Git, and a controller (Flux, ArgoCD) reconciles the cluster to match.
For GPU workloads, this gives you:
- Reproducible deployments: Same manifest = same result
- Auditable changes: Every change is a commit
- Rollback paths: git revert to undo
Repository structure
gitops/
├── clusters/
│ ├── production/
│ │ ├── flux-system/ # Flux components
│ │ └── infrastructure.yaml # Infrastructure Kustomization
│ └── staging/
│ └── ...
├── infrastructure/
│ ├── gpu-operator/ # NVIDIA GPU Operator
│ ├── monitoring/ # Prometheus + DCGM
│ └── storage/ # Longhorn or local-path
└── apps/
└── inference/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── kustomization.yaml
└── overlays/
├── staging/
│ └── kustomization.yaml
└── production/
└── kustomization.yaml
HelmRelease for model deployments
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: inference-api
namespace: inference
spec:
interval: 10m
chart:
spec:
chart: ./charts/inference
sourceRef:
kind: GitRepository
name: flux-system
values:
replicaCount: 3
image:
repository: vllm/vllm-openai
tag: v0.4.0
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
requests:
nvidia.com/gpu: 1
memory: 24Gi
model:
name: meta-llama/Llama-2-7b-chat-hf
cacheDir: /models
nodeSelector:
accelerator: nvidia
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Kustomize overlays for environments
# apps/inference/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: inference
resources:
- ../../base
patches:
- patch: |-
- op: replace
path: /spec/replicas
value: 5
target:
kind: Deployment
name: inference-api
- patch: |-
- op: add
path: /spec/template/spec/containers/0/resources/limits/nvidia.com~1gpu
value: 2
target:
kind: Deployment
name: inference-api
ImagePolicy for automated model updates
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
name: vllm
namespace: flux-system
spec:
image: vllm/vllm-openai
interval: 1h
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
name: vllm
namespace: flux-system
spec:
imageRepositoryRef:
name: vllm
policy:
semver:
range: '>=0.4.0 <1.0.0'
Networking and storage considerations
Networking
Inference traffic is often bursty; protect GPU nodes:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: inference-isolation
namespace: inference
spec:
podSelector:
matchLabels:
app: inference-api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- port: 8000
egress:
- to:
- namespaceSelector:
matchLabels:
name: model-storage
ports:
- port: 443
Storage for model artifacts
Models need fast, reliable storage:
# Local NVMe storage class (fastest cold start)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
# PVC for model cache
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: inference
spec:
storageClassName: local-nvme
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
Use init containers to pre-load models:
initContainers:
- name: model-download
image: alpine:latest
command:
- sh
- -c
- |
if [ ! -f /models/model.safetensors ]; then
wget -O /models/model.safetensors $MODEL_URL
fi
volumeMounts:
- name: model-cache
mountPath: /models
env:
- name: MODEL_URL
valueFrom:
secretKeyRef:
name: model-credentials
key: url
Cost comparison with cloud
TCO calculation example
Assumptions: 4x A100 80GB equivalent, 3-year horizon
| Cost Component | Cloud (Reserved 1yr) | On-Prem |
|---|---|---|
| Hardware | ||
| GPU servers (4x A100) | - | $120,000 (upfront) |
| Networking/switches | - | $10,000 |
| Ongoing (per year) | ||
| Compute cost | $80,000 | - |
| Power + cooling | Included | $8,000 |
| Datacenter/colo | Included | $12,000 |
| Ops/maintenance | Low (~$5,000) | $20,000 |
| 3-Year Total | $255,000 | $190,000 |
| Monthly effective cost | $7,083 | $5,278 |
On-prem wins by ~25% in this scenario, but only if:
- Utilization stays above 60%
- You have ops capacity
- Hardware lasts the full 3 years
Break-even analysis
| Monthly cloud spend | Break-even point |
|---|---|
| $25,000 | 24+ months |
| $50,000 | 12-18 months |
| $100,000 | 6-12 months |
| $200,000+ | 3-6 months |
The math works if:
- You can maintain high utilization
- You have or can build ops capability
- Your workloads are stable enough to plan capacity
Operational checklist
Before going on-prem:
- GPU utilization consistently above 50%
- Workload patterns predictable (not burst-heavy)
- Team has Kubernetes experience
- Datacenter/colo space available
- Power budget sufficient (A100: ~400W each)
- Networking planned (10-25 Gbps to storage)
- Monitoring stack ready (Prometheus + DCGM)
- Backup/DR strategy defined
- Escalation path for hardware failures
Next steps: If you are considering hybrid or on-prem GPU and want help with architecture and cost modeling, the AI Infra Readiness Audit includes a cloud-vs-hybrid TCO analysis.
Related reading:
- GPU Cost Baseline - How to measure cost accurately
- Rancher + Harvester GPU Platform - A real implementation example
- AI Infra Readiness Audit checklist - The full diagnostic framework
Related Articles
Comments
Join the discussion. Be respectful.