Standing Up a GPU-Ready Private AI Platform (Harvester + K3s + Flux + GitLab)
December 29, 2025·4 min read
A GPU-ready private AI platform (without the Rancher/Fleet stack)
If you want to run GPU inference 24/7, “just throw it on AWS” is a great default… until it isn’t. In my case, the constraints were more like:
- I want predictable cost for always-on services.
- I want data locality (and the option to keep sensitive data off the public internet).
- I want a workflow that’s boring: Git push → CI → Flux sync → running workloads.
This post is a case study in how I stood up a small GPU-ready private AI platform using Harvester for virtualization, K3s for the workload cluster, and GitLab + Flux for GitOps. My GPU fleet is mostly AMD right now (the one NVIDIA 980 Ti I still have is out of commission), so my defaults lean ROCm-first. I’m intentionally not using the “unified Rancher ecosystem” (and I’m not using Fleet): the core idea is that the control plane should be composable and easy to evolve.
TL;DR
- Harvester gives me a clean virtualization + storage foundation (KubeVirt + Longhorn) for cluster nodes.
- K3s keeps the workload cluster small and fast to operate.
- GitLab CI builds/publishes containers; Flux reconciles Kubernetes state from Git.
- GPU enablement is mostly about consistency: drivers + device plugin + node labeling/taints + runtime assumptions.
- The interesting part is the product layer: how
services/flexinfercan ship inference features safely on top of this.
Context: what “done” means
I’m not trying to recreate a hyperscaler. “Done” for this platform looks like:
- A clean path to deploy and roll back GPU workloads (Helm/Kustomize + Flux).
- A repeatable way to stand up new services under
services/flexinferwithout snowflake cluster changes. - Guardrails: secrets management, resource isolation, observability, and upgrades that don’t require heroics.
Layer 0: Harvester as the substrate
Harvester (KubeVirt on Kubernetes) is a practical middle ground between “pure bare metal k8s” and “traditional virtualization with a separate storage stack”:
- VMs for cluster nodes: workload cluster nodes are just VMs with predictable sizing.
- Storage: Longhorn as the default makes PVC behavior consistent (and debuggable).
- Networking: I keep it simple and avoid cleverness unless I need it (GPU inference doesn’t benefit from exotic networking).
The operational win is that node lifecycle becomes “treat nodes like cattle again,” even when the underlying hardware is not.
Layer 1: K3s for the workload cluster
I’m using K3s for the workload cluster because it’s lightweight, well-understood, and easy to repair. The tradeoff is that you have to be disciplined about what you add:
- Prefer “one controller per concern” (Flux, cert-manager, ingress, observability) instead of kitchen-sink bundles.
- Keep the cluster API surface small: fewer CRDs, fewer moving parts.
Layer 2: GitLab + Flux GitOps (no Fleet)
The contract I want is simple: the cluster is a projection of Git.
At a high level:
services/*repos build and test artifacts in GitLab CI.- Images publish to a registry with immutable tags.
- A separate GitOps repo declares desired state (HelmRelease/Kustomization).
- Flux reconciles that state into the cluster.
A minimal Flux Kustomization looks like this (pseudocode-ish, but close to what I run):
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: flexinfer-platform
namespace: flux-system
spec:
interval: 10m
path: ./clusters/prod
prune: true
sourceRef:
kind: GitRepository
name: platform-gitops
The key operational move is to make “how things ship” legible:
- CI owns tests + builds + SBOM/signing (if/when needed).
- GitOps owns rollout policy and drift correction.
- The cluster is not a place for manual edits.
GPU enablement: make it boring
GPU support fails in predictable ways: driver mismatch, runtime mismatch, or scheduling mismatch. I aim for boring invariants:
- Drivers/runtime: pick a known-good AMD driver + ROCm combo and don’t drift casually.
- Device plugin / operator: deploy via GitOps, pin versions, and treat upgrades like real changes.
- Scheduling: label/taint GPU nodes and make workloads declare intent.
On AMD nodes, my “sanity check” is intentionally unglamorous: confirm /dev/kfd + /dev/dri exist, validate the host with rocminfo/rocm-smi, then validate the container runtime can see the device nodes (before I even look at model code).
Example patterns I rely on:
- Taint GPU nodes and require a toleration.
- Use node selectors or affinity for “GPU-capable” pools.
- Request GPUs explicitly (for me:
amd.com/gpu: 1via an AMD GPU device plugin) and enforce limits with policy.
If the platform is healthy, a GPU workload should fail fast and obviously when it’s misconfigured.
How this maps to services/flexinfer (feature ideas that fit the stack)
Infrastructure is only interesting if it enables product velocity. Here are implementations I’d prioritize in services/flexinfer, because they play nicely with GitLab + Flux and make GPU ops safer:
- Inference gateway: a single entrypoint service that handles auth, routing, and per-tenant quotas; deploy it like any other stateless service.
- Model registry + promotion: treat models like artifacts with environments (
dev→staging→prod), promotion by commit/tag, and auditability. - Canary rollouts for model versions: use progressive delivery (e.g., weighted routing at the gateway) and measure error/latency deltas before full cutover.
- GPU pool isolation: separate “latency-critical inference” from “batch” using dedicated node pools and taints.
- Capacity signals: expose a simple API that answers “do we have free VRAM / GPU slots?” so schedulers and CI can make decisions.
None of these require a monolithic platform. They require crisp contracts and the discipline to keep the control loop boring.
What I’d do next
- Add policy (Kyverno/Gatekeeper) for “no GPU workloads without requests/limits”.
- Add a small SLO dashboard for inference: p95 latency, error rate, GPU utilization, and saturation.
- Formalize “break-glass” operational procedures (because you will need them).
If you’re building something similar, I’m happy to compare notes, especially around upgrade playbooks and the failure modes you only see after a few months of real usage.
Related Articles
Comments
Join the discussion. Be respectful.