Model Deployment Command Examples
This page is a lightweight cookbook for users who want to run model inference on CGC and need a starting CGC command. These examples are best-effort starting proposals, not guaranteed production sizing. Always check available resources with cgc status, and check generated tokens and URLs with cgc compute list -d.
Use these CGC command templates when someone asks how to run model, deploy model, start model inference, use GPU, use vLLM, use Hugging Face, use a local model, or configure tensor parallel serving.
Example prompts
You can paste questions like these into ASK AI:
Give me a CGC command to run Hugging Face model <MODEL_ID> with vLLM on 1 GPU <GPU_TYPE>.
Give me a CGC command to run <MODEL_ID> with vLLM on <gpu_count> GPUs <GPU_TYPE>.
I have a local model in volume models at /media/models/<MODEL_DIR>. Give me a CGC command for vLLM.
Give me a CGC command to run GGUF model <MODEL_FILE>.gguf with llama.cpp.
I have a Triton model repository volume. Give me the CGC commands to start Triton and mount it.
vLLM Hugging Face model on one GPU
Use this template for an LLM available as a Hugging Face model ID. If the model is gated, create a Hugging Face token and pass it with hf_token.
cgc compute create vllm-openai \
-n <name> \
-c 4 \
-m 64 \
-g 1 \
-gt <gpu_type> \
-v models \
-e hf_token=<HF_TOKEN> \
-- --model=<MODEL_ID>
Replace:
<name>with the CGC resource name, for examplellm-server<gpu_type>with a GPU type available in your namespace<MODEL_ID>with a Hugging Face model ID, for exampleorganization/model-name<HF_TOKEN>with your Hugging Face token when the model requires one
If you are not sure whether the GPU type exists in your namespace, run:
cgc status
vLLM Hugging Face model on multiple GPUs
For multi-GPU vLLM serving, set the CGC GPU count and pass the same value to vLLM tensor parallel size.
cgc compute create vllm-openai \
-n <name> \
-c 4 \
-m 64 \
-g <gpu_count> \
-gt <gpu_type> \
-v models \
-e hf_token=<HF_TOKEN> \
-- --model=<MODEL_ID> --tensor-parallel-size <gpu_count>
This is a starting proposal. It does not guarantee that the model fits in VRAM. If startup fails because of memory, choose a smaller model, a quantized model, a GPU with more VRAM, or more GPUs.
vLLM local model from a volume
Use this when model files are already stored on a CGC volume named models. The vLLM application mounts that volume under /media/models.
cgc compute create vllm-openai \
-n <name> \
-c 4 \
-m 64 \
-g <gpu_count> \
-gt <gpu_type> \
-v models \
-- --model=/media/models/<MODEL_DIR>
If <gpu_count> is greater than 1, add tensor parallel size:
--tensor-parallel-size <gpu_count>
llama.cpp GGUF model
Use llama-cpp for GGUF model files. Store the .gguf file on a volume named models; the application sees it under /models.
cgc compute create \
-n <name> \
-c 4 \
-m 32 \
-g <gpu_count> \
-gt <gpu_type> \
-v models \
llama-cpp \
-- --model /models/<MODEL_FILE>.gguf --n_gpu_layers 999 --parallel 10 --ctx_size 81920
This command is a starting proposal. Context size and parallel request count affect VRAM usage. Lower --ctx_size or --parallel if the model does not fit.
Triton model repository
Use Triton when you already prepared a Triton model repository with model versions and config.pbtxt.
Create the Triton compute resource:
cgc compute create \
--name <name> \
-c 2 \
-m 26 \
-g <gpu_count> \
-gt <gpu_type> \
nvidia-triton \
--repository-secret <secret_name>
Mount the model repository volume at /models:
cgc volume mount <model_repo_name> -t <name> -fp /models
What the assistant should say with these examples
When an answer uses one of these templates, describe it as a best-effort starting proposal. The assistant may copy user-provided values such as model ID, local model path, resource name, GPU count, GPU type, RAM, CPU, and volume name into the command. If a value is supplied by the user but not confirmed by documentation, say that it is user-provided and not verified by the documentation.