Skip to main content
fi-fhir docs

Service Information

fi-fhir Operations Runbook

Operational procedures for running fi-fhir in production.

Service Information

FieldValue
Service Namefi-fhir
Repositoryhttps://gitlab.flexinfer.ai/libs/fi-fhir
Primary LanguageGo
Default Port8080 (API), 9090 (metrics)
Health Endpoint/health
Ready Endpoint/ready

Quick Reference

Common Commands

# Check deployment status
kubectl -n fi-fhir get pods
kubectl -n fi-fhir describe deployment fi-fhir

# View logs
kubectl -n fi-fhir logs -f deployment/fi-fhir

# View logs with trace ID filter
kubectl -n fi-fhir logs deployment/fi-fhir | jq 'select(.trace_id == "abc123")'

# Port forward for debugging
kubectl -n fi-fhir port-forward svc/fi-fhir 8080:80

# Check metrics
curl http://localhost:9090/metrics | grep workflow_

# Restart deployment
kubectl -n fi-fhir rollout restart deployment/fi-fhir

# Scale deployment
kubectl -n fi-fhir scale deployment/fi-fhir --replicas=5

CLI Commands

# Parse a message
./fi-fhir parse --format hl7v2 --pretty < message.hl7

# Validate workflow configuration
./fi-fhir workflow validate workflow.yaml

# Run workflow in dry-run mode
./fi-fhir workflow run --dry-run --config workflow.yaml < events.json

# Check configuration
./fi-fhir config show
./fi-fhir config validate

# View version
./fi-fhir version

Monitoring

Key Metrics

MetricDescriptionAlert Threshold
workflow_events_processed_totalTotal events processedN/A (counter)
workflow_events_in_progressCurrently processing> 100
workflow_action_duration_secondsAction latencyp99 > 1s
workflow_action_errors_totalAction failuresrate > 0.01
workflow_dlq_sizeDead letter queue depth> 100
workflow_circuit_breaker_stateCircuit breaker status== 2 (open)
workflow_rate_limiter_rejected_totalRate limited requestsrate > 10/s

Dashboards

  • Grafana: dashboards/grafana/workflow-overview.json
  • Import dashboard via Grafana UI or provision via ConfigMap

Log Queries (Loki/Elasticsearch)

# Find errors in last hour
{namespace="fi-fhir"} |= "error" | json | level="error"

# Find slow actions (> 500ms)
{namespace="fi-fhir"} | json | duration_ms > 500

# Find by trace ID
{namespace="fi-fhir"} | json | trace_id="abc123"

# Find DLQ events
{namespace="fi-fhir"} |= "dlq" | json

Common Operations

Scaling

Manual scaling:

kubectl -n fi-fhir scale deployment/fi-fhir --replicas=5

Enable autoscaling:

helm upgrade fi-fhir deploy/helm/fi-fhir/ \
  --set autoscaling.enabled=true \
  --set autoscaling.minReplicas=2 \
  --set autoscaling.maxReplicas=10 \
  --reuse-values

Configuration Updates

Update workflow configuration:

# Edit configmap
kubectl -n fi-fhir edit configmap fi-fhir

# Or via Helm
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
  --set-file workflowConfig=new-workflow.yaml \
  --reuse-values

# Pods will automatically restart (checksum annotation)

Update environment variables:

helm upgrade fi-fhir deploy/helm/fi-fhir/ \
  --set config.observability.logLevel=debug \
  --reuse-values

Deployment

Rolling update:

# Update image tag
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
  --set image.tag=v1.2.0 \
  --reuse-values

# Monitor rollout
kubectl -n fi-fhir rollout status deployment/fi-fhir

Rollback:

# View history
kubectl -n fi-fhir rollout history deployment/fi-fhir

# Rollback to previous
kubectl -n fi-fhir rollout undo deployment/fi-fhir

# Rollback to specific revision
kubectl -n fi-fhir rollout undo deployment/fi-fhir --to-revision=3

# Helm rollback
helm rollback fi-fhir 1

Incident Response

Severity Levels

LevelDescriptionResponse TimeExamples
P1Complete outageImmediateAll pods down, no events processing
P2Degraded service15 minutesHigh error rate, circuit breaker open
P3Minor issue1 hourElevated latency, DLQ growing
P4Low impactNext business daySingle failed event, log warnings

Triage Steps

  1. Check service health:

    kubectl -n fi-fhir get pods
    kubectl -n fi-fhir describe pod <pod-name>
  2. Check recent logs:

    kubectl -n fi-fhir logs deployment/fi-fhir --since=10m | tail -100
  3. Check metrics:

    curl -s http://localhost:9090/metrics | grep -E "workflow_(errors|dlq|circuit)"
  4. Check dependencies:

    # Database connectivity
    kubectl -n fi-fhir exec deployment/fi-fhir -- nc -zv postgres 5432
    
    # FHIR server connectivity
    kubectl -n fi-fhir exec deployment/fi-fhir -- nc -zv fhir-server 443

Troubleshooting

Pod Not Starting

Symptoms: Pod in CrashLoopBackOff or Error state

Check:

kubectl -n fi-fhir describe pod <pod-name>
kubectl -n fi-fhir logs <pod-name> --previous

Common causes:

  • Invalid configuration: Check config validate output
  • Missing secrets: Verify secret exists and has required keys
  • Resource limits: Check if OOMKilled

Resolution:

# Fix configuration
./fi-fhir config validate

# Check secrets
kubectl -n fi-fhir get secret fi-fhir -o yaml

# Increase resources
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
  --set resources.limits.memory=1Gi \
  --reuse-values

High Error Rate

Symptoms: FiFhirHighErrorRate alert firing

Check:

# Find error patterns
kubectl -n fi-fhir logs deployment/fi-fhir | jq 'select(.level=="error")' | head -20

# Check specific action errors
curl -s http://localhost:9090/metrics | grep workflow_action_errors

Common causes:

  • External service down (FHIR server, database)
  • Authentication expired (OAuth token)
  • Rate limiting by external service

Resolution:

# Check circuit breaker state
curl -s http://localhost:9090/metrics | grep circuit_breaker

# If circuit breaker is open, wait for half-open or force reset
kubectl -n fi-fhir delete pod -l app.kubernetes.io/name=fi-fhir

# Check OAuth token
kubectl -n fi-fhir logs deployment/fi-fhir | grep -i oauth

Dead Letter Queue Growing

Symptoms: FiFhirDLQBacklog alert firing

Check:

# Check DLQ size
curl -s http://localhost:9090/metrics | grep workflow_dlq_size

# View DLQ entries in logs
kubectl -n fi-fhir logs deployment/fi-fhir | jq 'select(.message | contains("dlq"))'

Resolution:

# Investigate root cause first
# Then replay when issue is resolved
./fi-fhir workflow replay --dlq --since 24h --dry-run  # Preview
./fi-fhir workflow replay --dlq --since 24h            # Execute

High Latency

Symptoms: FiFhirHighLatency alert firing, p99 > 1s

Check:

# Check latency metrics
curl -s http://localhost:9090/metrics | grep workflow_action_duration

# Find slow requests in logs
kubectl -n fi-fhir logs deployment/fi-fhir | jq 'select(.duration_ms > 500)'

Common causes:

  • Database slow queries
  • External service latency
  • Insufficient resources

Resolution:

# Scale up
kubectl -n fi-fhir scale deployment/fi-fhir --replicas=5

# Check database
kubectl -n fi-fhir exec deployment/fi-fhir -- \
  psql -c "SELECT * FROM pg_stat_activity WHERE state='active'"

# Increase connection pool
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
  --set config.database.maxOpenConns=50 \
  --reuse-values

Circuit Breaker Open

Symptoms: FiFhirCircuitBreakerOpen alert firing

Check:

curl -s http://localhost:9090/metrics | grep circuit_breaker_state
# 0 = closed, 1 = half-open, 2 = open

Resolution:

  1. Identify failing external service from logs
  2. Verify external service is healthy
  3. Wait for circuit breaker to transition to half-open
  4. If urgent, restart pods to reset circuit breaker state
# Force circuit breaker reset
kubectl -n fi-fhir rollout restart deployment/fi-fhir

Memory Issues (OOMKilled)

Symptoms: Pod restarts with reason OOMKilled

Check:

kubectl -n fi-fhir describe pod <pod-name> | grep -A5 "Last State"
kubectl top pods -n fi-fhir

Resolution:

# Increase memory limits
helm upgrade fi-fhir deploy/helm/fi-fhir/ \
  --set resources.limits.memory=1Gi \
  --set resources.requests.memory=512Mi \
  --reuse-values

Maintenance

Certificate Rotation

# Check certificate expiry
kubectl -n fi-fhir get certificate fi-fhir-tls -o yaml | grep -A5 status

# Force renewal (cert-manager)
kubectl -n fi-fhir delete certificate fi-fhir-tls
# cert-manager will automatically create new certificate

Secret Rotation

# Update database password
kubectl -n fi-fhir create secret generic fi-fhir-new \
  --from-literal=database-password=newpassword

# Update deployment to use new secret
# Then delete old secret after verification

Database Maintenance

# Vacuum and analyze
kubectl -n fi-fhir exec deployment/fi-fhir -- \
  psql -c "VACUUM ANALYZE workflow_events;"

# Check table sizes
kubectl -n fi-fhir exec deployment/fi-fhir -- \
  psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_stat_user_tables ORDER BY pg_total_relation_size(relid) DESC;"

Log Rotation

Logs are managed by Kubernetes. For long-term retention:

  • Configure log aggregation (Loki, Elasticsearch)
  • Set retention policies appropriate for HIPAA (typically 6 years)

Emergency Procedures

Complete Service Outage

  1. Verify outage scope:

    kubectl -n fi-fhir get all
  2. Check cluster health:

    kubectl get nodes
    kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
  3. Attempt restart:

    kubectl -n fi-fhir rollout restart deployment/fi-fhir
  4. If restart fails, redeploy:

    helm upgrade fi-fhir deploy/helm/fi-fhir/ -f production-values.yaml
  5. If namespace is corrupted:

    kubectl delete namespace fi-fhir
    helm install fi-fhir deploy/helm/fi-fhir/ -f production-values.yaml -n fi-fhir --create-namespace

Data Recovery

See PRODUCTION-HARDENING.md for backup/restore procedures.

Rollback Bad Release

# Identify last good release
helm history fi-fhir

# Rollback
helm rollback fi-fhir <revision>

# Verify
kubectl -n fi-fhir rollout status deployment/fi-fhir

Contact Information

RoleContact
On-call EngineerPagerDuty: fi-fhir-oncall
Team Lead@team-lead
Security[email protected]
Database Admin[email protected]

Appendix

Environment Variables

See fi-fhir config env for complete list:

./fi-fhir config env

Helm Values Reference

# View all configurable values
helm show values deploy/helm/fi-fhir/

API Endpoints

EndpointMethodDescription
/healthGETLiveness check
/readyGETReadiness check
/metricsGETPrometheus metrics
/api/v1/parsePOSTParse message
/api/v1/workflowPOSTProcess event