Skip to main content

CLI Builder

Deploy a model to the FlexInfer cluster

Arguments

Model name or path (e.g., microsoft/Phi-3-mini-4k-instruct)

Options

Inference backend to use

Number of GPUs to use

GPU type

GPU memory per device (e.g., 16Gi, 24Gi)

Number of replicas

Service port

Maximum concurrent requests

Maximum context length

Quantization method

Examples

Basic deployment
flexinfer deploy microsoft/Phi-3-mini-4k-instruct

Deploy Phi-3 with default settings

Production deployment
flexinfer deploy meta-llama/Llama-3.1-8B-Instruct --backend vllm --gpu 2 --replicas 3 --max-concurrent 128

Deploy Llama 3.1 with high availability

CPU deployment
flexinfer deploy microsoft/Phi-3-mini-4k-instruct --backend llamacpp --gpu-type cpu

Deploy using llama.cpp on CPU

$ flexinfer deploy

Click "Simulate" to see example output