Skip to main content
Back to Blog

SLOs for Inference: Latency, Errors, Saturation

December 29, 2025·4 min read

Professionalreliabilitysloinferencemlopsai-infra-readiness

Uptime is not enough for inference workloads. A model endpoint can be "up" while returning garbage, timing out on every third request, or queuing so deeply that users give up.

Here's how to define SLOs that actually reflect user experience for GPU-based inference.

Why inference needs SLOs (not just uptime)

Traditional uptime metrics (99.9%, 99.99%) measure availability: is the service responding at all? For inference, that is necessary but not sufficient.

Users care about:

  • Latency: How long until they get a response?
  • Accuracy: Is the response useful?
  • Throughput: Can the system handle their load?
  • Consistency: Do they get the same quality every time?

An inference endpoint with 99.9% uptime but p99 latency of 30 seconds is not meeting user expectations.

SLIs, SLOs, and SLAs: The hierarchy

Before defining targets, understand the relationships:

Diagram showing the hierarchy from SLIs to SLOs to SLAs, with arrows labeled measured by and informed by, and examples for each layer.
Figure 1. Get the order right: measurable signals (SLIs) drive internal targets (SLOs), which drive external commitments (SLAs).

For inference, you typically need SLIs for latency, error rate, and throughput - then set SLOs on each.

The 4 golden signals for inference

Google's SRE book defines four golden signals for monitoring. For inference, they map like this:

1. Latency

  • p50: Median response time (what most users experience)
  • p95: Tail latency for most users
  • p99: Worst-case experience for 1 in 100 requests
  • Cold start: Time for first request after scaling up

Inference-specific considerations:

  • Batch size affects latency (larger batches = higher latency per request)
  • Model size affects cold start (large models take longer to load)
  • Token count affects response time (longer outputs take longer)

Prometheus queries for latency SLIs:

# p50, p95, p99 latency
histogram_quantile(0.50, rate(inference_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(inference_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(inference_request_duration_seconds_bucket[5m]))

# Latency by model
histogram_quantile(0.99,
  sum by (le, model) (rate(inference_request_duration_seconds_bucket[5m]))
)

# Time to first token (for streaming)
histogram_quantile(0.95, rate(inference_ttft_seconds_bucket[5m]))

2. Error rate

  • Hard errors: 5xx responses, timeouts, OOMs
  • Soft errors: Successful responses with low-quality outputs
  • Retries: Hidden failures that succeed on second attempt

For ML workloads, error rate should include:

  • Model inference failures
  • Input validation failures
  • Output format failures (malformed JSON, truncated responses)

Prometheus queries for error rate SLIs:

# Overall error rate
sum(rate(inference_requests_total{status=~"5.."}[5m]))
  / sum(rate(inference_requests_total[5m]))

# Error rate by error type
sum by (error_type) (rate(inference_errors_total[5m]))
  / sum(rate(inference_requests_total[5m]))

# Timeout rate specifically
sum(rate(inference_requests_total{status="504"}[5m]))
  / sum(rate(inference_requests_total[5m]))

3. Saturation

  • GPU utilization: Percentage of compute used
  • Memory pressure: GPU and system memory usage
  • Queue depth: Requests waiting to be processed

High saturation is not bad if latency stays within SLO. But saturation above 80-90% often predicts latency spikes.

Prometheus queries for saturation:

# GPU utilization (requires DCGM exporter)
avg(DCGM_FI_DEV_GPU_UTIL)

# GPU memory pressure
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL

# Request queue depth
inference_queue_depth

# Pending requests (waiting for GPU)
inference_requests_pending

4. Traffic

  • Request rate: Requests per second
  • Token throughput: Tokens generated per second (for LLMs)
  • Concurrent requests: Active inference calls

Traffic patterns inform capacity planning and help explain why other signals change.

# Request rate
sum(rate(inference_requests_total[5m]))

# Token throughput
sum(rate(inference_tokens_generated_total[5m]))

# Concurrent requests
inference_requests_in_flight

Example SLO specification

Here is a concrete example for an inference API:

# slo-spec.yaml
service: inference-api
owner: ml-platform-team

slos:
  - name: availability
    description: 'Service responds to requests'
    sli:
      metric: |
        sum(rate(inference_requests_total{status!~"5.."}[5m]))
        / sum(rate(inference_requests_total[5m]))
    target: 99.9
    window: 30d

  - name: latency-p99
    description: '99th percentile response time'
    sli:
      metric: |
        histogram_quantile(0.99,
          rate(inference_request_duration_seconds_bucket[5m]))
    target: 500ms
    window: 30d

  - name: latency-p50
    description: 'Median response time'
    sli:
      metric: |
        histogram_quantile(0.50,
          rate(inference_request_duration_seconds_bucket[5m]))
    target: 100ms
    window: 30d

error_budget:
  - name: availability
    monthly_budget: 43.2m # 0.1% of 30 days
    alert_thresholds:
      - burned: 50%
        action: review
      - burned: 75%
        action: freeze_deploys
      - burned: 90%
        action: incident

Error budgets for ML: accuracy vs availability

Traditional error budgets measure reliability: if your SLO is 99.9% uptime, you have an error budget of 0.1% downtime per month.

For ML, you might also track:

  • Accuracy budget: Acceptable percentage of low-quality responses
  • Latency budget: Acceptable percentage of slow responses
  • Cost budget: Acceptable cost per request variance

Error budget calculation

SLO TargetMonthly Error Budget
99.0%7.2 hours
99.5%3.6 hours
99.9%43.2 minutes
99.95%21.6 minutes
99.99%4.3 minutes

These help you make tradeoffs. If you are burning accuracy budget, maybe you need a better model. If you are burning latency budget, maybe you need more capacity.

When to spend error budget

Error budget is meant to be spent on:

  • Deploying new model versions (risky but necessary)
  • Infrastructure migrations
  • Performance experiments
  • Incident learning (some failures are acceptable if you learn from them)

If you never spend error budget, your SLOs are probably too loose.

Saturation and queueing

GPU inference has a queueing problem. Unlike CPU workloads where you can scale horizontally quickly, GPU scaling is:

  • Slower (cold starts, model loading)
  • More expensive (GPU instances cost more)
  • Less granular (you add whole GPUs, not fractional compute)

This means:

  • Queue depth is a leading indicator of latency problems
  • Autoscaling needs to trigger earlier than for CPU workloads
  • Backpressure strategies (rate limiting, load shedding) matter more

Queue depth monitoring

# Queue depth over time
inference_queue_depth

# Queue wait time
histogram_quantile(0.95, rate(inference_queue_wait_seconds_bucket[5m]))

# Correlation: queue depth predicts latency
# (Use for dashboard, not alerting)
increase(inference_queue_depth[5m]) > 10
  and increase(inference_request_duration_seconds_sum[5m]) > 0

Alerting strategy

Good alerts for inference workloads:

SignalWarning thresholdCritical threshold
p99 latency2x SLO target5x SLO target
Error rate0.5%2%
GPU utilization80% sustained95% sustained
Queue depth2x normal10x normal

Alert on symptoms (latency, errors) first, then investigate causes (saturation, queue depth).

Example Prometheus alerting rules

groups:
  - name: inference-slos
    rules:
      - alert: InferenceHighLatency
        expr: |
          histogram_quantile(0.99,
            rate(inference_request_duration_seconds_bucket[5m])
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'Inference p99 latency above 1s'
          description: 'p99 latency is {{ $value | humanizeDuration }}'

      - alert: InferenceHighErrorRate
        expr: |
          sum(rate(inference_requests_total{status=~"5.."}[5m]))
          / sum(rate(inference_requests_total[5m])) > 0.02
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: 'Inference error rate above 2%'

      - alert: InferenceQueueBacklog
        expr: inference_queue_depth > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: 'Inference request queue backing up'
          description: 'Queue depth is {{ $value }}'

      - alert: GPUSaturation
        expr: avg(DCGM_FI_DEV_GPU_UTIL) > 95
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: 'GPU utilization sustained above 95%'

Defining your SLOs

Start with user expectations:

  1. What latency do users tolerate before abandoning the request?
  2. What error rate causes users to lose trust?
  3. What throughput do you need to handle peak load?

Then set SLOs slightly tighter than minimum viable:

  • If users tolerate 5s latency, set SLO at 3s p99
  • If 2% error rate causes churn, set SLO at 1%
  • If peak load is 1000 req/s, plan for 1500

This gives you buffer to catch problems before they become user-visible.

Quick SLO checklist

  • Latency SLI defined (p50, p95, p99)
  • Error rate SLI defined (including soft errors)
  • Saturation metrics collected
  • SLO targets set based on user expectations
  • Error budget calculated and tracked
  • Alerting rules configured
  • Dashboard showing current SLO status

Next steps: If you need help defining SLOs and building the observability to track them, the AI Infra Readiness Audit includes SLI/SLO definition as a core deliverable.

Related reading:

Related Articles

Comments

Join the discussion. Be respectful.