CLI Builder

Docs:

Command

Deploy a model to the FlexInfer cluster

model*

Model name or path (e.g., microsoft/Phi-3-mini-4k-instruct)

-b, --backend

Inference backend to use

-g, --gpu

Number of GPUs to use

--gpu-type

GPU type

--gpu-memory

GPU memory per device (e.g., 16Gi, 24Gi)

-r, --replicas

Number of replicas

-p, --port

Service port

--max-concurrent

Maximum concurrent requests

--context-length

Maximum context length

-q, --quantization

Quantization method

--dry-runShow what would be deployed without executing

Basic deployment

flexinfer deploy microsoft/Phi-3-mini-4k-instruct

Deploy Phi-3 with default settings

Production deployment

flexinfer deploy meta-llama/Llama-3.1-8B-Instruct --backend vllm --gpu 2 --replicas 3 --max-concurrent 128

Deploy Llama 3.1 with high availability

CPU deployment

flexinfer deploy microsoft/Phi-3-mini-4k-instruct --backend llamacpp --gpu-type cpu

Deploy using llama.cpp on CPU

Generated Command

$ flexinfer deploy

Click "Simulate" to see example output