vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.

Choose and load a model

Pick any model you want to serve. In this example we'll use Llama‑3.2‑3B‑Instruct (~7 GiB).

Create a volume

Please be sure to mount a volume that is large enough to accommodate all the files. If you plan to create a model repository volume, provision at least 200 GB of storage. You can expand the volume later if needed.

cgc volume create -s 200 models

Use HuggingFace repository

If the chosen model is available on the HF repository, you can just pass its name when running a new vLLM instance.
Please be aware that you need to mount the previously created volume as there are constraints on ephemeral storage. Also, you don't want to download the model every time you create a new vLLM instance.

For further steps go to How to run section.

Put the model in the volume

The quickest way is to do it manually using the filebrowser GUI. You can run:

# Create if it doesn't exist
cgc compute filebrowser create

# Mount the new volume
cgc volume mount models -t filebrowser

To check the app token and URL under which it's available:

cgc compute list -d

When logging into filebrowser's web interface, use username admin and the app token as a password.

In the filebrowser, download all the model files from Hugging Face and drag them into the volume.

warning

Ensure you download the model in Hugging Face's original PyTorch format, not GGUF.

How to run

Huggingface model

To run an HF model, you need to make sure that you are allowed to download the model.
You'll need to generate a new HF Token. Some models are restricted, so you'll need to go to the model card website and accept the terms and conditions.

cgc compute create vllm-openai -n <name> -d hf_token=<HF_TOKEN> -d model=<MODEL_NAME> -g <gpu_count> -gt <gpu_type> -c <cpu_count> -m <ram> -v <volume_name>

To run a Bielik example: Before applying this command, change the HF token and go to the Bielik model card to accept the terms and conditions.

cgc compute create vllm-openai -n bielik-26 -d hf_token=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 -d model=speakleash/Bielik-11B-v2.6-Instruct -g 1 -gt a5000 -c 2 -m 24 -v models

CGC automatically uses your volume to download the model.

Your model

cgc compute create -n <name> -c 4 -m 8 -g 1 -gt A5000 -v models vllm-openai -d model=/media/models/Llama-3.2-3B-Instruct

Parameters

-n - instance name that will be used in URLs
-c - CPU cores, no more than 4 is needed in most cases
-m - memory, no more than 8 GiB is needed in most cases
-g, -gt - GPU count and type. Be sure that vRAM is sufficient for the model.
-v - volume with models
-d - flag to define arguments

Arguments

All the arguments listed below should be passed using the -d flag, for example -d model=... -d hf_token=....

model: (Required) Path to the model directory (e.g., /media/models/Llama-3.2-3B-Instruct) or a Hugging Face repository ID (e.g., speakleash/Bielik-11B-v2.6-Instruct).
hf_token: (Optional) Your Hugging Face token. This is required if you are downloading a gated model.
trust_remote_code: (Optional) If provided, allows the execution of remote code from the model's repository. By default, remote code is not trusted.
download_dir: (Optional) Specifies the directory where models are downloaded. Defaults to /media/models/huggingface.
max_model_len: (Optional) The maximum sequence length for the model. Defaults to 4096.
tensor_parallel_size: (Optional) The number of GPUs to use for tensor parallelism, allowing you to run models that are too large for a single GPU.
max_num_batched_tokens: (Optional) The maximum number of tokens in a single batch.
vllm_use_modelscope: (Optional) Set to true to use ModelScope for model loading.
vllm_use_precompiled: (Optional) Set to true to use precompiled kernels, which can speed up startup time.
vllm_use_v1: (Optional) Specifies the version of the vLLM engine to use.
uv_torch_backend: (Optional) Defines the backend for UV Torch.

Using -d you can pass all the arguments for the vllm serve command. A list of all possible arguments can be found here.

API usage

Your endpoint lives at:

https://<name>.<NAMESPACE>.cgc-waw-01.comtegra.cloud/

Fetch the API token:

cgc compute list -d

Example call

curl -H "Authorization: Bearer $API_TOKEN" \
     -H "Content-Type: application/json" \
     -X POST \
     -d '{
           "model": "/media/models/Llama-3.2-3B-Instruct",
           "messages": [
             {"role":"user","content":"Hello!"}
           ]
         }' \
     https://<name>.<NAMESPACE>.cgc-waw-01.comtegra.cloud/v1/chat/completions

vLLM

Choose and load a model​

Create a volume​

Use HuggingFace repository​

Put the model in the volume​

How to run​

Huggingface model​

Your model​

Parameters​

Arguments​

API usage​

Example call​