Skip to main content

Model Deployment Command Examples

This page is a lightweight cookbook for users who want to run model inference on CGC and need a starting CGC command. These examples are best-effort starting proposals, not guaranteed production sizing. Always check available resources with cgc status, and check generated tokens and URLs with cgc compute list -d.

Use these CGC command templates when someone asks how to run model, deploy model, start model inference, use GPU, use vLLM, use Hugging Face, use a local model, or configure tensor parallel serving.

Example prompts

You can paste questions like these into ASK AI:

Give me a CGC command to run Hugging Face model <MODEL_ID> with vLLM on 1 GPU <GPU_TYPE>.
Give me a CGC command to run <MODEL_ID> with vLLM on <gpu_count> GPUs <GPU_TYPE>.
I have a local model in volume models at /media/models/<MODEL_DIR>. Give me a CGC command for vLLM.
Give me a CGC command to run GGUF model <MODEL_FILE>.gguf with llama.cpp.
I have a Triton model repository volume. Give me the CGC commands to start Triton and mount it.

vLLM Hugging Face model on one GPU

Use this template for an LLM available as a Hugging Face model ID. If the model is gated, create a Hugging Face token and pass it with hf_token.

cgc compute create vllm-openai \
-n <name> \
-c 4 \
-m 64 \
-g 1 \
-gt <gpu_type> \
-v models \
-e hf_token=<HF_TOKEN> \
-- --model=<MODEL_ID>

Replace:

  • <name> with the CGC resource name, for example llm-server
  • <gpu_type> with a GPU type available in your namespace
  • <MODEL_ID> with a Hugging Face model ID, for example organization/model-name
  • <HF_TOKEN> with your Hugging Face token when the model requires one

If you are not sure whether the GPU type exists in your namespace, run:

cgc status

vLLM Hugging Face model on multiple GPUs

For multi-GPU vLLM serving, set the CGC GPU count and pass the same value to vLLM tensor parallel size.

cgc compute create vllm-openai \
-n <name> \
-c 4 \
-m 64 \
-g <gpu_count> \
-gt <gpu_type> \
-v models \
-e hf_token=<HF_TOKEN> \
-- --model=<MODEL_ID> --tensor-parallel-size <gpu_count>

This is a starting proposal. It does not guarantee that the model fits in VRAM. If startup fails because of memory, choose a smaller model, a quantized model, a GPU with more VRAM, or more GPUs.

vLLM local model from a volume

Use this when model files are already stored on a CGC volume named models. The vLLM application mounts that volume under /media/models.

cgc compute create vllm-openai \
-n <name> \
-c 4 \
-m 64 \
-g <gpu_count> \
-gt <gpu_type> \
-v models \
-- --model=/media/models/<MODEL_DIR>

If <gpu_count> is greater than 1, add tensor parallel size:

--tensor-parallel-size <gpu_count>

llama.cpp GGUF model

Use llama-cpp for GGUF model files. Store the .gguf file on a volume named models; the application sees it under /models.

cgc compute create \
-n <name> \
-c 4 \
-m 32 \
-g <gpu_count> \
-gt <gpu_type> \
-v models \
llama-cpp \
-- --model /models/<MODEL_FILE>.gguf --n_gpu_layers 999 --parallel 10 --ctx_size 81920

This command is a starting proposal. Context size and parallel request count affect VRAM usage. Lower --ctx_size or --parallel if the model does not fit.

Triton model repository

Use Triton when you already prepared a Triton model repository with model versions and config.pbtxt.

Create the Triton compute resource:

cgc compute create \
--name <name> \
-c 2 \
-m 26 \
-g <gpu_count> \
-gt <gpu_type> \
nvidia-triton \
--repository-secret <secret_name>

Mount the model repository volume at /models:

cgc volume mount <model_repo_name> -t <name> -fp /models

What the assistant should say with these examples

When an answer uses one of these templates, describe it as a best-effort starting proposal. The assistant may copy user-provided values such as model ID, local model path, resource name, GPU count, GPU type, RAM, CPU, and volume name into the command. If a value is supplied by the user but not confirmed by documentation, say that it is user-provided and not verified by the documentation.