SLOs for Inference: Latency, Errors, Saturation

December 29, 2025·4 min read

Professionalreliabilitysloinferencemlopsai-infra-readiness

Uptime is not enough for inference workloads. A model endpoint can be "up" while returning garbage, timing out on every third request, or queuing so deeply that users give up.

Here's how to define SLOs that actually reflect user experience for GPU-based inference.

Why inference needs SLOs (not just uptime)

Traditional uptime metrics (99.9%, 99.99%) measure availability: is the service responding at all? For inference, that is necessary but not sufficient.

Users care about:

Latency: How long until they get a response?
Accuracy: Is the response useful?
Throughput: Can the system handle their load?
Consistency: Do they get the same quality every time?

An inference endpoint with 99.9% uptime but p99 latency of 30 seconds is not meeting user expectations.

SLIs, SLOs, and SLAs: The hierarchy

Before defining targets, understand the relationships:

Diagram showing the hierarchy from SLIs to SLOs to SLAs, with arrows labeled measured by and informed by, and examples for each layer. — **Figure 1.** Get the order right: measurable signals (SLIs) drive internal targets (SLOs), which drive external commitments (SLAs).

For inference, you typically need SLIs for latency, error rate, and throughput - then set SLOs on each.

The 4 golden signals for inference

Google's SRE book defines four golden signals for monitoring. For inference, they map like this:

1. Latency

p50: Median response time (what most users experience)
p95: Tail latency for most users
p99: Worst-case experience for 1 in 100 requests
Cold start: Time for first request after scaling up

Inference-specific considerations:

Batch size affects latency (larger batches = higher latency per request)
Model size affects cold start (large models take longer to load)
Token count affects response time (longer outputs take longer)

Prometheus queries for latency SLIs:

# p50, p95, p99 latency
histogram_quantile(0.50, rate(inference_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(inference_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(inference_request_duration_seconds_bucket[5m]))

# Latency by model
histogram_quantile(0.99,
  sum by (le, model) (rate(inference_request_duration_seconds_bucket[5m]))
)

# Time to first token (for streaming)
histogram_quantile(0.95, rate(inference_ttft_seconds_bucket[5m]))

2. Error rate

Hard errors: 5xx responses, timeouts, OOMs
Soft errors: Successful responses with low-quality outputs
Retries: Hidden failures that succeed on second attempt

For ML workloads, error rate should include:

Model inference failures
Input validation failures
Output format failures (malformed JSON, truncated responses)

Prometheus queries for error rate SLIs:

# Overall error rate
sum(rate(inference_requests_total{status=~"5.."}[5m]))
  / sum(rate(inference_requests_total[5m]))

# Error rate by error type
sum by (error_type) (rate(inference_errors_total[5m]))
  / sum(rate(inference_requests_total[5m]))

# Timeout rate specifically
sum(rate(inference_requests_total{status="504"}[5m]))
  / sum(rate(inference_requests_total[5m]))

3. Saturation

GPU utilization: Percentage of compute used
Memory pressure: GPU and system memory usage
Queue depth: Requests waiting to be processed

High saturation is not bad if latency stays within SLO. But saturation above 80-90% often predicts latency spikes.

Prometheus queries for saturation:

# GPU utilization (requires DCGM exporter)
avg(DCGM_FI_DEV_GPU_UTIL)

# GPU memory pressure
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL

# Request queue depth
inference_queue_depth

# Pending requests (waiting for GPU)
inference_requests_pending

4. Traffic

Request rate: Requests per second
Token throughput: Tokens generated per second (for LLMs)
Concurrent requests: Active inference calls

Traffic patterns inform capacity planning and help explain why other signals change.

# Request rate
sum(rate(inference_requests_total[5m]))

# Token throughput
sum(rate(inference_tokens_generated_total[5m]))

# Concurrent requests
inference_requests_in_flight

Example SLO specification

Here is a concrete example for an inference API:

# slo-spec.yaml
service: inference-api
owner: ml-platform-team

slos:
  - name: availability
    description: 'Service responds to requests'
    sli:
      metric: |
        sum(rate(inference_requests_total{status!~"5.."}[5m]))
        / sum(rate(inference_requests_total[5m]))
    target: 99.9
    window: 30d

  - name: latency-p99
    description: '99th percentile response time'
    sli:
      metric: |
        histogram_quantile(0.99,
          rate(inference_request_duration_seconds_bucket[5m]))
    target: 500ms
    window: 30d

  - name: latency-p50
    description: 'Median response time'
    sli:
      metric: |
        histogram_quantile(0.50,
          rate(inference_request_duration_seconds_bucket[5m]))
    target: 100ms
    window: 30d

error_budget:
  - name: availability
    monthly_budget: 43.2m # 0.1% of 30 days
    alert_thresholds:
      - burned: 50%
        action: review
      - burned: 75%
        action: freeze_deploys
      - burned: 90%
        action: incident

Error budgets for ML: accuracy vs availability

Traditional error budgets measure reliability: if your SLO is 99.9% uptime, you have an error budget of 0.1% downtime per month.

For ML, you might also track:

Accuracy budget: Acceptable percentage of low-quality responses
Latency budget: Acceptable percentage of slow responses
Cost budget: Acceptable cost per request variance

Error budget calculation

SLO Target	Monthly Error Budget
99.0%	7.2 hours
99.5%	3.6 hours
99.9%	43.2 minutes
99.95%	21.6 minutes
99.99%	4.3 minutes

These help you make tradeoffs. If you are burning accuracy budget, maybe you need a better model. If you are burning latency budget, maybe you need more capacity.

When to spend error budget

Error budget is meant to be spent on:

Deploying new model versions (risky but necessary)
Infrastructure migrations
Performance experiments
Incident learning (some failures are acceptable if you learn from them)

If you never spend error budget, your SLOs are probably too loose.

Saturation and queueing

GPU inference has a queueing problem. Unlike CPU workloads where you can scale horizontally quickly, GPU scaling is:

Slower (cold starts, model loading)
More expensive (GPU instances cost more)
Less granular (you add whole GPUs, not fractional compute)

This means:

Queue depth is a leading indicator of latency problems
Autoscaling needs to trigger earlier than for CPU workloads
Backpressure strategies (rate limiting, load shedding) matter more

Queue depth monitoring

# Queue depth over time
inference_queue_depth

# Queue wait time
histogram_quantile(0.95, rate(inference_queue_wait_seconds_bucket[5m]))

# Correlation: queue depth predicts latency
# (Use for dashboard, not alerting)
increase(inference_queue_depth[5m]) > 10
  and increase(inference_request_duration_seconds_sum[5m]) > 0

Alerting strategy

Good alerts for inference workloads:

Signal	Warning threshold	Critical threshold
p99 latency	2x SLO target	5x SLO target
Error rate	0.5%	2%
GPU utilization	80% sustained	95% sustained
Queue depth	2x normal	10x normal

Alert on symptoms (latency, errors) first, then investigate causes (saturation, queue depth).

Example Prometheus alerting rules

groups:
  - name: inference-slos
    rules:
      - alert: InferenceHighLatency
        expr: |
          histogram_quantile(0.99,
            rate(inference_request_duration_seconds_bucket[5m])
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'Inference p99 latency above 1s'
          description: 'p99 latency is {{ $value | humanizeDuration }}'

      - alert: InferenceHighErrorRate
        expr: |
          sum(rate(inference_requests_total{status=~"5.."}[5m]))
          / sum(rate(inference_requests_total[5m])) > 0.02
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: 'Inference error rate above 2%'

      - alert: InferenceQueueBacklog
        expr: inference_queue_depth > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: 'Inference request queue backing up'
          description: 'Queue depth is {{ $value }}'

      - alert: GPUSaturation
        expr: avg(DCGM_FI_DEV_GPU_UTIL) > 95
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: 'GPU utilization sustained above 95%'

Defining your SLOs

Start with user expectations:

What latency do users tolerate before abandoning the request?
What error rate causes users to lose trust?
What throughput do you need to handle peak load?

Then set SLOs slightly tighter than minimum viable:

If users tolerate 5s latency, set SLO at 3s p99
If 2% error rate causes churn, set SLO at 1%
If peak load is 1000 req/s, plan for 1500

This gives you buffer to catch problems before they become user-visible.

Quick SLO checklist

Latency SLI defined (p50, p95, p99)
Error rate SLI defined (including soft errors)
Saturation metrics collected
SLO targets set based on user expectations
Error budget calculated and tracked
Alerting rules configured
Dashboard showing current SLO status

Next steps: If you need help defining SLOs and building the observability to track them, the AI Infra Readiness Audit includes SLI/SLO definition as a core deliverable.

Related reading:

GPU Cost Baseline - Cost metrics that complement reliability metrics
GPU Failure Modes - Common failures that break SLOs
AI Infra Readiness Audit checklist - The full diagnostic framework

3 min read

mlopsreliability

AI Infra Readiness Audit: What I Check (and What You Get)

A practical checklist for auditing production AI infrastructure: GPU cost baselines, reliability risks, and an executable roadmap.

4 min read

mlopsai-infra-readiness

GPU Cost Baseline: What to Measure, What Lies

Before you can cut GPU costs, you need to measure them correctly. Here is what to track and what the cloud console will not tell you.

5 min read

reliabilityai-infra-readiness

GPU Failure Modes: What Breaks and How to Debug It

Common GPU infrastructure failures in production and how to diagnose them before they become incidents.

Comments

Join the discussion. Be respectful.