Skip to main content
Back to Blog

Standing Up a GPU-Ready Private AI Platform (Harvester + K3s + Flux + GitLab)

December 29, 2025·4 min read

Professionalcase-studyplatform-engineeringkubernetesk3sharvestergpuamdradeon

A GPU-ready private AI platform (without the Rancher/Fleet stack)

If you want to run GPU inference 24/7, “just throw it on AWS” is a great default… until it isn’t. In my case, the constraints were more like:

  • I want predictable cost for always-on services.
  • I want data locality (and the option to keep sensitive data off the public internet).
  • I want a workflow that’s boring: Git push → CI → Flux sync → running workloads.

This post is a case study in how I stood up a small GPU-ready private AI platform using Harvester for virtualization, K3s for the workload cluster, and GitLab + Flux for GitOps. My GPU fleet is mostly AMD right now (the one NVIDIA 980 Ti I still have is out of commission), so my defaults lean ROCm-first. I’m intentionally not using the “unified Rancher ecosystem” (and I’m not using Fleet): the core idea is that the control plane should be composable and easy to evolve.

TL;DR

  • Harvester gives me a clean virtualization + storage foundation (KubeVirt + Longhorn) for cluster nodes.
  • K3s keeps the workload cluster small and fast to operate.
  • GitLab CI builds/publishes containers; Flux reconciles Kubernetes state from Git.
  • GPU enablement is mostly about consistency: drivers + device plugin + node labeling/taints + runtime assumptions.
  • The interesting part is the product layer: how services/flexinfer can ship inference features safely on top of this.
A GitOps flow diagram showing GitLab repos and CI building images, publishing to a registry, and Flux reconciling Helm/Kustomize deployments into a K3s workload cluster running on Harvester-provisioned VMs with GPU nodes.
Figure 1. The “boring” control loop: GitLab → CI → registry → Flux → K3s, with Harvester underneath providing the VM substrate (and GPUs on the right).

Context: what “done” means

I’m not trying to recreate a hyperscaler. “Done” for this platform looks like:

  • A clean path to deploy and roll back GPU workloads (Helm/Kustomize + Flux).
  • A repeatable way to stand up new services under services/flexinfer without snowflake cluster changes.
  • Guardrails: secrets management, resource isolation, observability, and upgrades that don’t require heroics.

Layer 0: Harvester as the substrate

Harvester (KubeVirt on Kubernetes) is a practical middle ground between “pure bare metal k8s” and “traditional virtualization with a separate storage stack”:

  • VMs for cluster nodes: workload cluster nodes are just VMs with predictable sizing.
  • Storage: Longhorn as the default makes PVC behavior consistent (and debuggable).
  • Networking: I keep it simple and avoid cleverness unless I need it (GPU inference doesn’t benefit from exotic networking).

The operational win is that node lifecycle becomes “treat nodes like cattle again,” even when the underlying hardware is not.

Layer 1: K3s for the workload cluster

I’m using K3s for the workload cluster because it’s lightweight, well-understood, and easy to repair. The tradeoff is that you have to be disciplined about what you add:

  • Prefer “one controller per concern” (Flux, cert-manager, ingress, observability) instead of kitchen-sink bundles.
  • Keep the cluster API surface small: fewer CRDs, fewer moving parts.

Layer 2: GitLab + Flux GitOps (no Fleet)

The contract I want is simple: the cluster is a projection of Git.

At a high level:

  1. services/* repos build and test artifacts in GitLab CI.
  2. Images publish to a registry with immutable tags.
  3. A separate GitOps repo declares desired state (HelmRelease/Kustomization).
  4. Flux reconciles that state into the cluster.

A minimal Flux Kustomization looks like this (pseudocode-ish, but close to what I run):

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: flexinfer-platform
  namespace: flux-system
spec:
  interval: 10m
  path: ./clusters/prod
  prune: true
  sourceRef:
    kind: GitRepository
    name: platform-gitops

The key operational move is to make “how things ship” legible:

  • CI owns tests + builds + SBOM/signing (if/when needed).
  • GitOps owns rollout policy and drift correction.
  • The cluster is not a place for manual edits.

GPU enablement: make it boring

GPU support fails in predictable ways: driver mismatch, runtime mismatch, or scheduling mismatch. I aim for boring invariants:

  • Drivers/runtime: pick a known-good AMD driver + ROCm combo and don’t drift casually.
  • Device plugin / operator: deploy via GitOps, pin versions, and treat upgrades like real changes.
  • Scheduling: label/taint GPU nodes and make workloads declare intent.

On AMD nodes, my “sanity check” is intentionally unglamorous: confirm /dev/kfd + /dev/dri exist, validate the host with rocminfo/rocm-smi, then validate the container runtime can see the device nodes (before I even look at model code).

Example patterns I rely on:

  • Taint GPU nodes and require a toleration.
  • Use node selectors or affinity for “GPU-capable” pools.
  • Request GPUs explicitly (for me: amd.com/gpu: 1 via an AMD GPU device plugin) and enforce limits with policy.

If the platform is healthy, a GPU workload should fail fast and obviously when it’s misconfigured.

How this maps to services/flexinfer (feature ideas that fit the stack)

Infrastructure is only interesting if it enables product velocity. Here are implementations I’d prioritize in services/flexinfer, because they play nicely with GitLab + Flux and make GPU ops safer:

  • Inference gateway: a single entrypoint service that handles auth, routing, and per-tenant quotas; deploy it like any other stateless service.
  • Model registry + promotion: treat models like artifacts with environments (devstagingprod), promotion by commit/tag, and auditability.
  • Canary rollouts for model versions: use progressive delivery (e.g., weighted routing at the gateway) and measure error/latency deltas before full cutover.
  • GPU pool isolation: separate “latency-critical inference” from “batch” using dedicated node pools and taints.
  • Capacity signals: expose a simple API that answers “do we have free VRAM / GPU slots?” so schedulers and CI can make decisions.

None of these require a monolithic platform. They require crisp contracts and the discipline to keep the control loop boring.

What I’d do next

  • Add policy (Kyverno/Gatekeeper) for “no GPU workloads without requests/limits”.
  • Add a small SLO dashboard for inference: p95 latency, error rate, GPU utilization, and saturation.
  • Formalize “break-glass” operational procedures (because you will need them).

If you’re building something similar, I’m happy to compare notes, especially around upgrade playbooks and the failure modes you only see after a few months of real usage.

Related Articles

Comments

Join the discussion. Be respectful.