Skip to main content
Back to Blog

Hybrid/On-Prem GPU: The Boring GitOps Path

December 29, 2025·4 min read

Professionalgpukubernetesgitopson-premhybridai-infra-readiness

Cloud GPU pricing is brutal at scale. If you have predictable, high-utilization workloads, on-prem or hybrid GPU infrastructure can cut costs 50-80%. But only if you can operate it reliably.

Here's the "boring" path: using Kubernetes and GitOps patterns to make on-prem GPU feel as reliable as managed cloud services.

When on-prem GPU makes sense

On-prem is worth considering when:

  • Utilization is high: 60%+ average GPU utilization
  • Workloads are predictable: Steady-state inference, not burst experimentation
  • You have ops capacity: Team can handle hardware and Kubernetes
  • Data locality matters: Compliance, latency, or bandwidth constraints

On-prem does not make sense for:

  • Experimentation and R&D (use cloud spot instances)
  • Burst workloads with unpredictable demand
  • Teams without Kubernetes experience

The break-even point varies, but roughly: if you are spending $50k+/month on GPU cloud compute with high utilization, on-prem starts to pay off within 12-18 months.

Hybrid architecture overview

A typical hybrid setup separates GPU workloads from control plane and storage:

Hybrid architecture diagram showing a cloud control plane and object storage connected over VPN/direct link to on-prem GPU Kubernetes workers with local model cache, monitoring, and optional cloud burst capacity.
Figure 1. Keep the control plane boring; keep GPU capacity close to data; use cloud burst as a pressure valve.

Key decisions:

  • Control plane: Cloud-managed (simpler) or self-hosted (more control)
  • Networking: VPN for dev, dedicated link for production
  • Storage: Local NVMe for models, object storage for artifacts

Hardware selection: NVIDIA vs AMD

NVIDIA (CUDA)

ProsCons
Best ecosystem and toolingHigher cost
Widest model compatibilitySupply constraints
Mature driversVendor lock-in risk
Extensive documentationLess price competition

Popular choices: A100 (80GB), H100, L40S

AMD (ROCm)

ProsCons
30-50% lower costSmaller ecosystem
Growing rapidlySome model compatibility gaps
Open-source stackLess documentation
Good for inference workloadsRequires validation effort

Popular choices: MI300X, MI250, RX 7900 XTX (consumer, dev only)

For production inference with popular models (LLaMA, Mistral, etc.), both work. NVIDIA is the safe choice; AMD is the cost-optimized choice if you are willing to validate compatibility.

Kubernetes GPU scheduling basics

Kubernetes handles GPU scheduling via:

  1. Device plugins: Expose GPUs to the scheduler (NVIDIA, AMD provide these)
  2. Resource requests/limits: Pods request nvidia.com/gpu: 1 or amd.com/gpu: 1
  3. Node selectors/taints: Route GPU workloads to GPU nodes

Installing the NVIDIA device plugin

# Add the NVIDIA Helm repo
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

# Install the device plugin
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --namespace kube-system \
  --set gfd.enabled=true \
  --set deviceListStrategy=volume-mounts

Installing the AMD device plugin

# Apply the AMD device plugin
kubectl apply -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml

# Verify GPUs are visible
kubectl get nodes -o json | jq '.items[].status.allocatable["amd.com/gpu"]'

GPU pod example

apiVersion: v1
kind: Pod
metadata:
  name: inference-gpu
spec:
  containers:
    - name: model
      image: vllm/vllm-openai:latest
      resources:
        limits:
          nvidia.com/gpu: 1 # or amd.com/gpu: 1
      env:
        - name: CUDA_VISIBLE_DEVICES
          value: '0'
  nodeSelector:
    accelerator: nvidia # or amd
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

Multi-tenant GPU scheduling

For shared clusters, add quotas and priorities:

# Resource quota per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-ml
spec:
  hard:
    requests.nvidia.com/gpu: '4'
    limits.nvidia.com/gpu: '4'
---
# Priority class for critical inference
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: inference-critical
value: 1000000
globalDefault: false
description: 'Critical inference workloads'

GitOps for inference: Flux patterns

GitOps means: the desired state of your cluster is defined in Git, and a controller (Flux, ArgoCD) reconciles the cluster to match.

For GPU workloads, this gives you:

  • Reproducible deployments: Same manifest = same result
  • Auditable changes: Every change is a commit
  • Rollback paths: git revert to undo

Repository structure

gitops/
├── clusters/
│   ├── production/
│   │   ├── flux-system/           # Flux components
│   │   └── infrastructure.yaml    # Infrastructure Kustomization
│   └── staging/
│       └── ...
├── infrastructure/
│   ├── gpu-operator/              # NVIDIA GPU Operator
│   ├── monitoring/                # Prometheus + DCGM
│   └── storage/                   # Longhorn or local-path
└── apps/
    └── inference/
        ├── base/
        │   ├── deployment.yaml
        │   ├── service.yaml
        │   └── kustomization.yaml
        └── overlays/
            ├── staging/
            │   └── kustomization.yaml
            └── production/
                └── kustomization.yaml

HelmRelease for model deployments

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: inference-api
  namespace: inference
spec:
  interval: 10m
  chart:
    spec:
      chart: ./charts/inference
      sourceRef:
        kind: GitRepository
        name: flux-system
  values:
    replicaCount: 3
    image:
      repository: vllm/vllm-openai
      tag: v0.4.0
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: 32Gi
      requests:
        nvidia.com/gpu: 1
        memory: 24Gi
    model:
      name: meta-llama/Llama-2-7b-chat-hf
      cacheDir: /models
    nodeSelector:
      accelerator: nvidia
    tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Kustomize overlays for environments

# apps/inference/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: inference
resources:
  - ../../base
patches:
  - patch: |-
      - op: replace
        path: /spec/replicas
        value: 5
    target:
      kind: Deployment
      name: inference-api
  - patch: |-
      - op: add
        path: /spec/template/spec/containers/0/resources/limits/nvidia.com~1gpu
        value: 2
    target:
      kind: Deployment
      name: inference-api

ImagePolicy for automated model updates

apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
  name: vllm
  namespace: flux-system
spec:
  image: vllm/vllm-openai
  interval: 1h
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
  name: vllm
  namespace: flux-system
spec:
  imageRepositoryRef:
    name: vllm
  policy:
    semver:
      range: '>=0.4.0 <1.0.0'

Networking and storage considerations

Networking

Inference traffic is often bursty; protect GPU nodes:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: inference-isolation
  namespace: inference
spec:
  podSelector:
    matchLabels:
      app: inference-api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - port: 8000
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: model-storage
      ports:
        - port: 443

Storage for model artifacts

Models need fast, reliable storage:

# Local NVMe storage class (fastest cold start)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
# PVC for model cache
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: inference
spec:
  storageClassName: local-nvme
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 500Gi

Use init containers to pre-load models:

initContainers:
  - name: model-download
    image: alpine:latest
    command:
      - sh
      - -c
      - |
        if [ ! -f /models/model.safetensors ]; then
          wget -O /models/model.safetensors $MODEL_URL
        fi
    volumeMounts:
      - name: model-cache
        mountPath: /models
    env:
      - name: MODEL_URL
        valueFrom:
          secretKeyRef:
            name: model-credentials
            key: url

Cost comparison with cloud

TCO calculation example

Assumptions: 4x A100 80GB equivalent, 3-year horizon

Cost ComponentCloud (Reserved 1yr)On-Prem
Hardware
GPU servers (4x A100)-$120,000 (upfront)
Networking/switches-$10,000
Ongoing (per year)
Compute cost$80,000-
Power + coolingIncluded$8,000
Datacenter/coloIncluded$12,000
Ops/maintenanceLow (~$5,000)$20,000
3-Year Total$255,000$190,000
Monthly effective cost$7,083$5,278

On-prem wins by ~25% in this scenario, but only if:

  • Utilization stays above 60%
  • You have ops capacity
  • Hardware lasts the full 3 years

Break-even analysis

Monthly cloud spendBreak-even point
$25,00024+ months
$50,00012-18 months
$100,0006-12 months
$200,000+3-6 months

The math works if:

  • You can maintain high utilization
  • You have or can build ops capability
  • Your workloads are stable enough to plan capacity

Operational checklist

Before going on-prem:

  • GPU utilization consistently above 50%
  • Workload patterns predictable (not burst-heavy)
  • Team has Kubernetes experience
  • Datacenter/colo space available
  • Power budget sufficient (A100: ~400W each)
  • Networking planned (10-25 Gbps to storage)
  • Monitoring stack ready (Prometheus + DCGM)
  • Backup/DR strategy defined
  • Escalation path for hardware failures

Next steps: If you are considering hybrid or on-prem GPU and want help with architecture and cost modeling, the AI Infra Readiness Audit includes a cloud-vs-hybrid TCO analysis.

Related reading:

Related Articles

Comments

Join the discussion. Be respectful.