Model Deployment Command Examples

This page is a lightweight cookbook for users who want to run model inference on CGC and need a starting CGC command. These examples are best-effort starting proposals, not guaranteed production sizing. Always check available resources with cgc status, and check generated tokens and URLs with cgc compute list -d.

Use these CGC command templates when someone asks how to run model, deploy model, start model inference, use GPU, use vLLM, use Hugging Face, use a local model, or configure tensor parallel serving.

Example prompts

You can paste questions like these into ASK AI:

Give me a CGC command to run Hugging Face model <MODEL_ID> with vLLM on 1 GPU <GPU_TYPE>.

Give me a CGC command to run <MODEL_ID> with vLLM on <gpu_count> GPUs <GPU_TYPE>.

I have a local model in volume models at /media/models/<MODEL_DIR>. Give me a CGC command for vLLM.

Give me a CGC command to run GGUF model <MODEL_FILE>.gguf with llama.cpp.

I have a Triton model repository volume. Give me the CGC commands to start Triton and mount it.

vLLM Hugging Face model on one GPU

Use this template for an LLM available as a Hugging Face model ID. If the model is gated, create a Hugging Face token and pass it with hf_token.

cgc compute create vllm-openai \
  -n <name> \
  -c 4 \
  -m 64 \
  -g 1 \
  -gt <gpu_type> \
  -v models \
  -e hf_token=<HF_TOKEN> \
  -- --model=<MODEL_ID>

Replace:

<name> with the CGC resource name, for example llm-server
<gpu_type> with a GPU type available in your namespace
<MODEL_ID> with a Hugging Face model ID, for example organization/model-name
<HF_TOKEN> with your Hugging Face token when the model requires one

If you are not sure whether the GPU type exists in your namespace, run:

cgc status

vLLM Hugging Face model on multiple GPUs

For multi-GPU vLLM serving, set the CGC GPU count and pass the same value to vLLM tensor parallel size.

cgc compute create vllm-openai \
  -n <name> \
  -c 4 \
  -m 64 \
  -g <gpu_count> \
  -gt <gpu_type> \
  -v models \
  -e hf_token=<HF_TOKEN> \
  -- --model=<MODEL_ID> --tensor-parallel-size <gpu_count>

This is a starting proposal. It does not guarantee that the model fits in VRAM. If startup fails because of memory, choose a smaller model, a quantized model, a GPU with more VRAM, or more GPUs.

vLLM local model from a volume

Use this when model files are already stored on a CGC volume named models. The vLLM application mounts that volume under /media/models.

cgc compute create vllm-openai \
  -n <name> \
  -c 4 \
  -m 64 \
  -g <gpu_count> \
  -gt <gpu_type> \
  -v models \
  -- --model=/media/models/<MODEL_DIR>

If <gpu_count> is greater than 1, add tensor parallel size:

--tensor-parallel-size <gpu_count>

llama.cpp GGUF model

Use llama-cpp for GGUF model files. Store the .gguf file on a volume named models; the application sees it under /models.

cgc compute create \
  -n <name> \
  -c 4 \
  -m 32 \
  -g <gpu_count> \
  -gt <gpu_type> \
  -v models \
  llama-cpp \
  -- --model /models/<MODEL_FILE>.gguf --n_gpu_layers 999 --parallel 10 --ctx_size 81920

This command is a starting proposal. Context size and parallel request count affect VRAM usage. Lower --ctx_size or --parallel if the model does not fit.

Triton model repository

Use Triton when you already prepared a Triton model repository with model versions and config.pbtxt.

Create the Triton compute resource:

cgc compute create \
  --name <name> \
  -c 2 \
  -m 26 \
  -g <gpu_count> \
  -gt <gpu_type> \
  nvidia-triton \
  --repository-secret <secret_name>

Mount the model repository volume at /models:

cgc volume mount <model_repo_name> -t <name> -fp /models

What the assistant should say with these examples

When an answer uses one of these templates, describe it as a best-effort starting proposal. The assistant may copy user-provided values such as model ID, local model path, resource name, GPU count, GPU type, RAM, CPU, and volume name into the command. If a value is supplied by the user but not confirmed by documentation, say that it is user-provided and not verified by the documentation.

Example prompts​

vLLM Hugging Face model on one GPU​

vLLM Hugging Face model on multiple GPUs​

vLLM local model from a volume​

llama.cpp GGUF model​

Triton model repository​

What the assistant should say with these examples​