Service Information
fi-fhir Operations Runbook
Operational procedures for running fi-fhir in production.
Service Information
| Field | Value |
|---|---|
| Service Name | fi-fhir |
| Repository | https://gitlab.flexinfer.ai/libs/fi-fhir |
| Primary Language | Go |
| Default Port | 8080 (API), 9090 (metrics) |
| Health Endpoint | /health |
| Ready Endpoint | /ready |
Quick Reference
Common Commands
# Check deployment status
kubectl -n fi-fhir get pods
kubectl -n fi-fhir describe deployment fi-fhir
# View logs
kubectl -n fi-fhir logs -f deployment/fi-fhir
# View logs with trace ID filter
kubectl -n fi-fhir logs deployment/fi-fhir | jq 'select(.trace_id == "abc123")'
# Port forward for debugging
kubectl -n fi-fhir port-forward svc/fi-fhir 8080:80
# Check metrics
curl http://localhost:9090/metrics | grep workflow_
# Restart deployment
kubectl -n fi-fhir rollout restart deployment/fi-fhir
# Scale deployment
kubectl -n fi-fhir scale deployment/fi-fhir --replicas=5
CLI Commands
# Parse a message
./fi-fhir parse --format hl7v2 --pretty < message.hl7
# Validate workflow configuration
./fi-fhir workflow validate workflow.yaml
# Run workflow in dry-run mode
./fi-fhir workflow run --dry-run --config workflow.yaml < events.json
# Check configuration
./fi-fhir config show
./fi-fhir config validate
# View version
./fi-fhir version
Monitoring
Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
workflow_events_processed_total | Total events processed | N/A (counter) |
workflow_events_in_progress | Currently processing | > 100 |
workflow_action_duration_seconds | Action latency | p99 > 1s |
workflow_action_errors_total | Action failures | rate > 0.01 |
workflow_dlq_size | Dead letter queue depth | > 100 |
workflow_circuit_breaker_state | Circuit breaker status | == 2 (open) |
workflow_rate_limiter_rejected_total | Rate limited requests | rate > 10/s |
Dashboards
- Grafana:
dashboards/grafana/workflow-overview.json - Import dashboard via Grafana UI or provision via ConfigMap
Log Queries (Loki/Elasticsearch)
# Find errors in last hour
{namespace="fi-fhir"} |= "error" | json | level="error"
# Find slow actions (> 500ms)
{namespace="fi-fhir"} | json | duration_ms > 500
# Find by trace ID
{namespace="fi-fhir"} | json | trace_id="abc123"
# Find DLQ events
{namespace="fi-fhir"} |= "dlq" | json
Common Operations
Scaling
Manual scaling:
kubectl -n fi-fhir scale deployment/fi-fhir --replicas=5
Enable autoscaling:
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
--set autoscaling.enabled=true \
--set autoscaling.minReplicas=2 \
--set autoscaling.maxReplicas=10 \
--reuse-values
Configuration Updates
Update workflow configuration:
# Edit configmap
kubectl -n fi-fhir edit configmap fi-fhir
# Or via Helm
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
--set-file workflowConfig=new-workflow.yaml \
--reuse-values
# Pods will automatically restart (checksum annotation)
Update environment variables:
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
--set config.observability.logLevel=debug \
--reuse-values
Deployment
Rolling update:
# Update image tag
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
--set image.tag=v1.2.0 \
--reuse-values
# Monitor rollout
kubectl -n fi-fhir rollout status deployment/fi-fhir
Rollback:
# View history
kubectl -n fi-fhir rollout history deployment/fi-fhir
# Rollback to previous
kubectl -n fi-fhir rollout undo deployment/fi-fhir
# Rollback to specific revision
kubectl -n fi-fhir rollout undo deployment/fi-fhir --to-revision=3
# Helm rollback
helm rollback fi-fhir 1
Incident Response
Severity Levels
| Level | Description | Response Time | Examples |
|---|---|---|---|
| P1 | Complete outage | Immediate | All pods down, no events processing |
| P2 | Degraded service | 15 minutes | High error rate, circuit breaker open |
| P3 | Minor issue | 1 hour | Elevated latency, DLQ growing |
| P4 | Low impact | Next business day | Single failed event, log warnings |
Triage Steps
-
Check service health:
kubectl -n fi-fhir get pods kubectl -n fi-fhir describe pod <pod-name> -
Check recent logs:
kubectl -n fi-fhir logs deployment/fi-fhir --since=10m | tail -100 -
Check metrics:
curl -s http://localhost:9090/metrics | grep -E "workflow_(errors|dlq|circuit)" -
Check dependencies:
# Database connectivity kubectl -n fi-fhir exec deployment/fi-fhir -- nc -zv postgres 5432 # FHIR server connectivity kubectl -n fi-fhir exec deployment/fi-fhir -- nc -zv fhir-server 443
Troubleshooting
Pod Not Starting
Symptoms: Pod in CrashLoopBackOff or Error state
Check:
kubectl -n fi-fhir describe pod <pod-name>
kubectl -n fi-fhir logs <pod-name> --previous
Common causes:
- Invalid configuration: Check
config validateoutput - Missing secrets: Verify secret exists and has required keys
- Resource limits: Check if OOMKilled
Resolution:
# Fix configuration
./fi-fhir config validate
# Check secrets
kubectl -n fi-fhir get secret fi-fhir -o yaml
# Increase resources
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
--set resources.limits.memory=1Gi \
--reuse-values
High Error Rate
Symptoms: FiFhirHighErrorRate alert firing
Check:
# Find error patterns
kubectl -n fi-fhir logs deployment/fi-fhir | jq 'select(.level=="error")' | head -20
# Check specific action errors
curl -s http://localhost:9090/metrics | grep workflow_action_errors
Common causes:
- External service down (FHIR server, database)
- Authentication expired (OAuth token)
- Rate limiting by external service
Resolution:
# Check circuit breaker state
curl -s http://localhost:9090/metrics | grep circuit_breaker
# If circuit breaker is open, wait for half-open or force reset
kubectl -n fi-fhir delete pod -l app.kubernetes.io/name=fi-fhir
# Check OAuth token
kubectl -n fi-fhir logs deployment/fi-fhir | grep -i oauth
Dead Letter Queue Growing
Symptoms: FiFhirDLQBacklog alert firing
Check:
# Check DLQ size
curl -s http://localhost:9090/metrics | grep workflow_dlq_size
# View DLQ entries in logs
kubectl -n fi-fhir logs deployment/fi-fhir | jq 'select(.message | contains("dlq"))'
Resolution:
# Investigate root cause first
# Then replay when issue is resolved
./fi-fhir workflow replay --dlq --since 24h --dry-run # Preview
./fi-fhir workflow replay --dlq --since 24h # Execute
High Latency
Symptoms: FiFhirHighLatency alert firing, p99 > 1s
Check:
# Check latency metrics
curl -s http://localhost:9090/metrics | grep workflow_action_duration
# Find slow requests in logs
kubectl -n fi-fhir logs deployment/fi-fhir | jq 'select(.duration_ms > 500)'
Common causes:
- Database slow queries
- External service latency
- Insufficient resources
Resolution:
# Scale up
kubectl -n fi-fhir scale deployment/fi-fhir --replicas=5
# Check database
kubectl -n fi-fhir exec deployment/fi-fhir -- \
psql -c "SELECT * FROM pg_stat_activity WHERE state='active'"
# Increase connection pool
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
--set config.database.maxOpenConns=50 \
--reuse-values
Circuit Breaker Open
Symptoms: FiFhirCircuitBreakerOpen alert firing
Check:
curl -s http://localhost:9090/metrics | grep circuit_breaker_state
# 0 = closed, 1 = half-open, 2 = open
Resolution:
- Identify failing external service from logs
- Verify external service is healthy
- Wait for circuit breaker to transition to half-open
- If urgent, restart pods to reset circuit breaker state
# Force circuit breaker reset
kubectl -n fi-fhir rollout restart deployment/fi-fhir
Memory Issues (OOMKilled)
Symptoms: Pod restarts with reason OOMKilled
Check:
kubectl -n fi-fhir describe pod <pod-name> | grep -A5 "Last State"
kubectl top pods -n fi-fhir
Resolution:
# Increase memory limits
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
--set resources.limits.memory=1Gi \
--set resources.requests.memory=512Mi \
--reuse-values
Maintenance
Certificate Rotation
# Check certificate expiry
kubectl -n fi-fhir get certificate fi-fhir-tls -o yaml | grep -A5 status
# Force renewal (cert-manager)
kubectl -n fi-fhir delete certificate fi-fhir-tls
# cert-manager will automatically create new certificate
Secret Rotation
# Update database password
kubectl -n fi-fhir create secret generic fi-fhir-new \
--from-literal=database-password=newpassword
# Update deployment to use new secret
# Then delete old secret after verification
Database Maintenance
# Vacuum and analyze
kubectl -n fi-fhir exec deployment/fi-fhir -- \
psql -c "VACUUM ANALYZE workflow_events;"
# Check table sizes
kubectl -n fi-fhir exec deployment/fi-fhir -- \
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_stat_user_tables ORDER BY pg_total_relation_size(relid) DESC;"
Log Rotation
Logs are managed by Kubernetes. For long-term retention:
- Configure log aggregation (Loki, Elasticsearch)
- Set retention policies appropriate for HIPAA (typically 6 years)
Emergency Procedures
Complete Service Outage
-
Verify outage scope:
kubectl -n fi-fhir get all -
Check cluster health:
kubectl get nodes kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20 -
Attempt restart:
kubectl -n fi-fhir rollout restart deployment/fi-fhir -
If restart fails, redeploy:
helm upgrade fi-fhir deploy/helm/fi-fhir/ -f production-values.yaml -
If namespace is corrupted:
kubectl delete namespace fi-fhir helm install fi-fhir deploy/helm/fi-fhir/ -f production-values.yaml -n fi-fhir --create-namespace
Data Recovery
See PRODUCTION-HARDENING.md for backup/restore procedures.
Rollback Bad Release
# Identify last good release
helm history fi-fhir
# Rollback
helm rollback fi-fhir <revision>
# Verify
kubectl -n fi-fhir rollout status deployment/fi-fhir
Contact Information
| Role | Contact |
|---|---|
| On-call Engineer | PagerDuty: fi-fhir-oncall |
| Team Lead | @team-lead |
| Security | [email protected] |
| Database Admin | [email protected] |
Appendix
Environment Variables
See fi-fhir config env for complete list:
./fi-fhir config env
Helm Values Reference
# View all configurable values
helm show values deploy/helm/fi-fhir/
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health | GET | Liveness check |
/ready | GET | Readiness check |
/metrics | GET | Prometheus metrics |
/api/v1/parse | POST | Parse message |
/api/v1/workflow | POST | Process event |