SLOs for Inference: Latency, Errors, Saturation
December 29, 2025·4 min read
Uptime is not enough for inference workloads. A model endpoint can be "up" while returning garbage, timing out on every third request, or queuing so deeply that users give up.
Here's how to define SLOs that actually reflect user experience for GPU-based inference.
Why inference needs SLOs (not just uptime)
Traditional uptime metrics (99.9%, 99.99%) measure availability: is the service responding at all? For inference, that is necessary but not sufficient.
Users care about:
- Latency: How long until they get a response?
- Accuracy: Is the response useful?
- Throughput: Can the system handle their load?
- Consistency: Do they get the same quality every time?
An inference endpoint with 99.9% uptime but p99 latency of 30 seconds is not meeting user expectations.
SLIs, SLOs, and SLAs: The hierarchy
Before defining targets, understand the relationships:
For inference, you typically need SLIs for latency, error rate, and throughput - then set SLOs on each.
The 4 golden signals for inference
Google's SRE book defines four golden signals for monitoring. For inference, they map like this:
1. Latency
- p50: Median response time (what most users experience)
- p95: Tail latency for most users
- p99: Worst-case experience for 1 in 100 requests
- Cold start: Time for first request after scaling up
Inference-specific considerations:
- Batch size affects latency (larger batches = higher latency per request)
- Model size affects cold start (large models take longer to load)
- Token count affects response time (longer outputs take longer)
Prometheus queries for latency SLIs:
# p50, p95, p99 latency
histogram_quantile(0.50, rate(inference_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(inference_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(inference_request_duration_seconds_bucket[5m]))
# Latency by model
histogram_quantile(0.99,
sum by (le, model) (rate(inference_request_duration_seconds_bucket[5m]))
)
# Time to first token (for streaming)
histogram_quantile(0.95, rate(inference_ttft_seconds_bucket[5m]))
2. Error rate
- Hard errors: 5xx responses, timeouts, OOMs
- Soft errors: Successful responses with low-quality outputs
- Retries: Hidden failures that succeed on second attempt
For ML workloads, error rate should include:
- Model inference failures
- Input validation failures
- Output format failures (malformed JSON, truncated responses)
Prometheus queries for error rate SLIs:
# Overall error rate
sum(rate(inference_requests_total{status=~"5.."}[5m]))
/ sum(rate(inference_requests_total[5m]))
# Error rate by error type
sum by (error_type) (rate(inference_errors_total[5m]))
/ sum(rate(inference_requests_total[5m]))
# Timeout rate specifically
sum(rate(inference_requests_total{status="504"}[5m]))
/ sum(rate(inference_requests_total[5m]))
3. Saturation
- GPU utilization: Percentage of compute used
- Memory pressure: GPU and system memory usage
- Queue depth: Requests waiting to be processed
High saturation is not bad if latency stays within SLO. But saturation above 80-90% often predicts latency spikes.
Prometheus queries for saturation:
# GPU utilization (requires DCGM exporter)
avg(DCGM_FI_DEV_GPU_UTIL)
# GPU memory pressure
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL
# Request queue depth
inference_queue_depth
# Pending requests (waiting for GPU)
inference_requests_pending
4. Traffic
- Request rate: Requests per second
- Token throughput: Tokens generated per second (for LLMs)
- Concurrent requests: Active inference calls
Traffic patterns inform capacity planning and help explain why other signals change.
# Request rate
sum(rate(inference_requests_total[5m]))
# Token throughput
sum(rate(inference_tokens_generated_total[5m]))
# Concurrent requests
inference_requests_in_flight
Example SLO specification
Here is a concrete example for an inference API:
# slo-spec.yaml
service: inference-api
owner: ml-platform-team
slos:
- name: availability
description: 'Service responds to requests'
sli:
metric: |
sum(rate(inference_requests_total{status!~"5.."}[5m]))
/ sum(rate(inference_requests_total[5m]))
target: 99.9
window: 30d
- name: latency-p99
description: '99th percentile response time'
sli:
metric: |
histogram_quantile(0.99,
rate(inference_request_duration_seconds_bucket[5m]))
target: 500ms
window: 30d
- name: latency-p50
description: 'Median response time'
sli:
metric: |
histogram_quantile(0.50,
rate(inference_request_duration_seconds_bucket[5m]))
target: 100ms
window: 30d
error_budget:
- name: availability
monthly_budget: 43.2m # 0.1% of 30 days
alert_thresholds:
- burned: 50%
action: review
- burned: 75%
action: freeze_deploys
- burned: 90%
action: incident
Error budgets for ML: accuracy vs availability
Traditional error budgets measure reliability: if your SLO is 99.9% uptime, you have an error budget of 0.1% downtime per month.
For ML, you might also track:
- Accuracy budget: Acceptable percentage of low-quality responses
- Latency budget: Acceptable percentage of slow responses
- Cost budget: Acceptable cost per request variance
Error budget calculation
| SLO Target | Monthly Error Budget |
|---|---|
| 99.0% | 7.2 hours |
| 99.5% | 3.6 hours |
| 99.9% | 43.2 minutes |
| 99.95% | 21.6 minutes |
| 99.99% | 4.3 minutes |
These help you make tradeoffs. If you are burning accuracy budget, maybe you need a better model. If you are burning latency budget, maybe you need more capacity.
When to spend error budget
Error budget is meant to be spent on:
- Deploying new model versions (risky but necessary)
- Infrastructure migrations
- Performance experiments
- Incident learning (some failures are acceptable if you learn from them)
If you never spend error budget, your SLOs are probably too loose.
Saturation and queueing
GPU inference has a queueing problem. Unlike CPU workloads where you can scale horizontally quickly, GPU scaling is:
- Slower (cold starts, model loading)
- More expensive (GPU instances cost more)
- Less granular (you add whole GPUs, not fractional compute)
This means:
- Queue depth is a leading indicator of latency problems
- Autoscaling needs to trigger earlier than for CPU workloads
- Backpressure strategies (rate limiting, load shedding) matter more
Queue depth monitoring
# Queue depth over time
inference_queue_depth
# Queue wait time
histogram_quantile(0.95, rate(inference_queue_wait_seconds_bucket[5m]))
# Correlation: queue depth predicts latency
# (Use for dashboard, not alerting)
increase(inference_queue_depth[5m]) > 10
and increase(inference_request_duration_seconds_sum[5m]) > 0
Alerting strategy
Good alerts for inference workloads:
| Signal | Warning threshold | Critical threshold |
|---|---|---|
| p99 latency | 2x SLO target | 5x SLO target |
| Error rate | 0.5% | 2% |
| GPU utilization | 80% sustained | 95% sustained |
| Queue depth | 2x normal | 10x normal |
Alert on symptoms (latency, errors) first, then investigate causes (saturation, queue depth).
Example Prometheus alerting rules
groups:
- name: inference-slos
rules:
- alert: InferenceHighLatency
expr: |
histogram_quantile(0.99,
rate(inference_request_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: 'Inference p99 latency above 1s'
description: 'p99 latency is {{ $value | humanizeDuration }}'
- alert: InferenceHighErrorRate
expr: |
sum(rate(inference_requests_total{status=~"5.."}[5m]))
/ sum(rate(inference_requests_total[5m])) > 0.02
for: 5m
labels:
severity: critical
annotations:
summary: 'Inference error rate above 2%'
- alert: InferenceQueueBacklog
expr: inference_queue_depth > 100
for: 10m
labels:
severity: warning
annotations:
summary: 'Inference request queue backing up'
description: 'Queue depth is {{ $value }}'
- alert: GPUSaturation
expr: avg(DCGM_FI_DEV_GPU_UTIL) > 95
for: 15m
labels:
severity: warning
annotations:
summary: 'GPU utilization sustained above 95%'
Defining your SLOs
Start with user expectations:
- What latency do users tolerate before abandoning the request?
- What error rate causes users to lose trust?
- What throughput do you need to handle peak load?
Then set SLOs slightly tighter than minimum viable:
- If users tolerate 5s latency, set SLO at 3s p99
- If 2% error rate causes churn, set SLO at 1%
- If peak load is 1000 req/s, plan for 1500
This gives you buffer to catch problems before they become user-visible.
Quick SLO checklist
- Latency SLI defined (p50, p95, p99)
- Error rate SLI defined (including soft errors)
- Saturation metrics collected
- SLO targets set based on user expectations
- Error budget calculated and tracked
- Alerting rules configured
- Dashboard showing current SLO status
Next steps: If you need help defining SLOs and building the observability to track them, the AI Infra Readiness Audit includes SLI/SLO definition as a core deliverable.
Related reading:
- GPU Cost Baseline - Cost metrics that complement reliability metrics
- GPU Failure Modes - Common failures that break SLOs
- AI Infra Readiness Audit checklist - The full diagnostic framework
Related Articles
Comments
Join the discussion. Be respectful.