CLI Builder
Menu
Deploy a model to the FlexInfer cluster
Arguments
Model name or path (e.g., microsoft/Phi-3-mini-4k-instruct)
Options
Inference backend to use
Number of GPUs to use
GPU type
GPU memory per device (e.g., 16Gi, 24Gi)
Number of replicas
Service port
Maximum concurrent requests
Maximum context length
Quantization method
Examples
Basic deployment
flexinfer deploy microsoft/Phi-3-mini-4k-instructDeploy Phi-3 with default settings
Production deployment
flexinfer deploy meta-llama/Llama-3.1-8B-Instruct --backend vllm --gpu 2 --replicas 3 --max-concurrent 128Deploy Llama 3.1 with high availability
CPU deployment
flexinfer deploy microsoft/Phi-3-mini-4k-instruct --backend llamacpp --gpu-type cpuDeploy using llama.cpp on CPU
$ flexinfer deployClick "Simulate" to see example output