Standing Up a GPU-Ready Private AI Platform (Harvester + K3s + Flux + GitLab)

December 29, 2025·4 min read

Professionalcase-studyplatform-engineeringkubernetesk3sharvestergpuamdradeon

A GPU-ready private AI platform (without the Rancher/Fleet stack)

If you want to run GPU inference 24/7, “just throw it on AWS” is a great default… until it isn’t. In my case, the constraints were more like:

I want predictable cost for always-on services.
I want data locality (and the option to keep sensitive data off the public internet).
I want a workflow that’s boring: Git push → CI → Flux sync → running workloads.

This post is a case study in how I stood up a small GPU-ready private AI platform using Harvester for virtualization, K3s for the workload cluster, and GitLab + Flux for GitOps. My GPU fleet is mostly AMD right now (the one NVIDIA 980 Ti I still have is out of commission), so my defaults lean ROCm-first. I’m intentionally not using the “unified Rancher ecosystem” (and I’m not using Fleet): the core idea is that the control plane should be composable and easy to evolve.

TL;DR

Harvester gives me a clean virtualization + storage foundation (KubeVirt + Longhorn) for cluster nodes.
K3s keeps the workload cluster small and fast to operate.
GitLab CI builds/publishes containers; Flux reconciles Kubernetes state from Git.
GPU enablement is mostly about consistency: drivers + device plugin + node labeling/taints + runtime assumptions.
The interesting part is the product layer: how services/flexinfer can ship inference features safely on top of this.

A GitOps flow diagram showing GitLab repos and CI building images, publishing to a registry, and Flux reconciling Helm/Kustomize deployments into a K3s workload cluster running on Harvester-provisioned VMs with GPU nodes. — **Figure 1.** The “boring” control loop: GitLab → CI → registry → Flux → K3s, with Harvester underneath providing the VM substrate (and GPUs on the right).

Context: what “done” means

I’m not trying to recreate a hyperscaler. “Done” for this platform looks like:

A clean path to deploy and roll back GPU workloads (Helm/Kustomize + Flux).
A repeatable way to stand up new services under services/flexinfer without snowflake cluster changes.
Guardrails: secrets management, resource isolation, observability, and upgrades that don’t require heroics.

Layer 0: Harvester as the substrate

Harvester (KubeVirt on Kubernetes) is a practical middle ground between “pure bare metal k8s” and “traditional virtualization with a separate storage stack”:

VMs for cluster nodes: workload cluster nodes are just VMs with predictable sizing.
Storage: Longhorn as the default makes PVC behavior consistent (and debuggable).
Networking: I keep it simple and avoid cleverness unless I need it (GPU inference doesn’t benefit from exotic networking).

The operational win is that node lifecycle becomes “treat nodes like cattle again,” even when the underlying hardware is not.

Layer 1: K3s for the workload cluster

I’m using K3s for the workload cluster because it’s lightweight, well-understood, and easy to repair. The tradeoff is that you have to be disciplined about what you add:

Prefer “one controller per concern” (Flux, cert-manager, ingress, observability) instead of kitchen-sink bundles.
Keep the cluster API surface small: fewer CRDs, fewer moving parts.

Layer 2: GitLab + Flux GitOps (no Fleet)

The contract I want is simple: the cluster is a projection of Git.

At a high level:

services/* repos build and test artifacts in GitLab CI.
Images publish to a registry with immutable tags.
A separate GitOps repo declares desired state (HelmRelease/Kustomization).
Flux reconciles that state into the cluster.

A minimal Flux Kustomization looks like this (pseudocode-ish, but close to what I run):

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: flexinfer-platform
  namespace: flux-system
spec:
  interval: 10m
  path: ./clusters/prod
  prune: true
  sourceRef:
    kind: GitRepository
    name: platform-gitops

The key operational move is to make “how things ship” legible:

CI owns tests + builds + SBOM/signing (if/when needed).
GitOps owns rollout policy and drift correction.
The cluster is not a place for manual edits.

GPU enablement: make it boring

GPU support fails in predictable ways: driver mismatch, runtime mismatch, or scheduling mismatch. I aim for boring invariants:

Drivers/runtime: pick a known-good AMD driver + ROCm combo and don’t drift casually.
Device plugin / operator: deploy via GitOps, pin versions, and treat upgrades like real changes.
Scheduling: label/taint GPU nodes and make workloads declare intent.

On AMD nodes, my “sanity check” is intentionally unglamorous: confirm /dev/kfd + /dev/dri exist, validate the host with rocminfo/rocm-smi, then validate the container runtime can see the device nodes (before I even look at model code).

Example patterns I rely on:

Taint GPU nodes and require a toleration.
Use node selectors or affinity for “GPU-capable” pools.
Request GPUs explicitly (for me: amd.com/gpu: 1 via an AMD GPU device plugin) and enforce limits with policy.

If the platform is healthy, a GPU workload should fail fast and obviously when it’s misconfigured.

How this maps to `services/flexinfer` (feature ideas that fit the stack)

Infrastructure is only interesting if it enables product velocity. Here are implementations I’d prioritize in services/flexinfer, because they play nicely with GitLab + Flux and make GPU ops safer:

Inference gateway: a single entrypoint service that handles auth, routing, and per-tenant quotas; deploy it like any other stateless service.
Model registry + promotion: treat models like artifacts with environments (dev → staging → prod), promotion by commit/tag, and auditability.
Canary rollouts for model versions: use progressive delivery (e.g., weighted routing at the gateway) and measure error/latency deltas before full cutover.
GPU pool isolation: separate “latency-critical inference” from “batch” using dedicated node pools and taints.
Capacity signals: expose a simple API that answers “do we have free VRAM / GPU slots?” so schedulers and CI can make decisions.

None of these require a monolithic platform. They require crisp contracts and the discipline to keep the control loop boring.

What I’d do next

Add policy (Kyverno/Gatekeeper) for “no GPU workloads without requests/limits”.
Add a small SLO dashboard for inference: p95 latency, error rate, GPU utilization, and saturation.
Formalize “break-glass” operational procedures (because you will need them).

If you’re building something similar, I’m happy to compare notes, especially around upgrade playbooks and the failure modes you only see after a few months of real usage.

10 min read

rocmamd

Deploying MLC-LLM on Dual RX 7900 XTX GPUs: Debugging VRAM, KV Cache, and K8s GPU Scheduling

What actually broke when I deployed MLC-LLM across two RX 7900 XTX nodes, and the fixes that made it stable: quantization, KV cache sizing, and Kubernetes GPU hygiene.

4 min read

gpukubernetes

Hybrid/On-Prem GPU: The Boring GitOps Path

A practical guide to running GPU workloads on-prem or hybrid, using Kubernetes and GitOps patterns that make operations boring.

2 min read