vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving.
Choose and load a model​
Pick any model you want to serve. In this example we'll use Llama‑3.2‑3B‑Instruct (~7 GiB).
Create a volume​
Please be sure to mount a volume that is large enough to accommodate all the files. If you plan to create a model repository volume, provision at least 200 GB of storage. You can expand the volume later if needed.
cgc volume create -s 200 models
Use HuggingFace repository​
If the chosen model is available on the HF repository, you can just pass its name when running a new vLLM instance.
Please be aware that you need to mount the previously created volume as there are constraints on ephemeral storage. Also, you don't want to download the model every time you create a new vLLM instance.
For further steps go to How to run section.
Put the model in the volume​
The quickest way is to do it manually using the filebrowser GUI. You can run:
# Create if it doesn't exist
cgc compute filebrowser create
# Mount the new volume
cgc volume mount models -t filebrowser
To check the app token and URL under which it's available:
cgc compute list -d
When logging into filebrowser's web interface, use username admin
and the app token as a password.
In the filebrowser, download all the model files from Hugging Face and drag them into the volume.
Ensure you download the model in Hugging Face's original PyTorch format, not GGUF.
How to run​
Huggingface model​
To run an HF model, you need to make sure that you are allowed to download the model.
You'll need to generate a new HF Token. Some models are restricted, so you'll need to go to the model card website and accept the terms and conditions.
cgc compute create vllm-openai -n <name> -d hf_token=<HF_TOKEN> -d model=<MODEL_NAME> -g <gpu_count> -gt <gpu_type> -c <cpu_count> -m <ram> -v <volume_name>
To run a Bielik example: Before applying this command, change the HF token and go to the Bielik model card to accept the terms and conditions.
cgc compute create vllm-openai -n bielik-26 -d hf_token=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 -d model=speakleash/Bielik-11B-v2.6-Instruct -g 1 -gt a5000 -c 2 -m 24 -v models
CGC automatically uses your volume to download the model.
Your model​
cgc compute create -n <name> -c 4 -m 8 -g 1 -gt A5000 -v models vllm-openai -d model=/media/models/Llama-3.2-3B-Instruct
Parameters​
-n
- instance name that will be used in URLs-c
- CPU cores, no more than 4 is needed in most cases-m
- memory, no more than 8 GiB is needed in most cases-g
,-gt
- GPU count and type. Be sure that vRAM is sufficient for the model.-v
- volume with models-d
- flag to define arguments
Arguments​
All the arguments listed below should be passed using the -d
flag, for example -d model=... -d hf_token=...
.
model
: (Required) Path to the model directory (e.g.,/media/models/Llama-3.2-3B-Instruct
) or a Hugging Face repository ID (e.g.,speakleash/Bielik-11B-v2.6-Instruct
).hf_token
: (Optional) Your Hugging Face token. This is required if you are downloading a gated model.trust_remote_code
: (Optional) If provided, allows the execution of remote code from the model's repository. By default, remote code is not trusted.download_dir
: (Optional) Specifies the directory where models are downloaded. Defaults to/media/models/huggingface
.max_model_len
: (Optional) The maximum sequence length for the model. Defaults to4096
.tensor_parallel_size
: (Optional) The number of GPUs to use for tensor parallelism, allowing you to run models that are too large for a single GPU.max_num_batched_tokens
: (Optional) The maximum number of tokens in a single batch.vllm_use_modelscope
: (Optional) Set totrue
to use ModelScope for model loading.vllm_use_precompiled
: (Optional) Set totrue
to use precompiled kernels, which can speed up startup time.vllm_use_v1
: (Optional) Specifies the version of the vLLM engine to use.uv_torch_backend
: (Optional) Defines the backend for UV Torch.
Using -d
you can pass all the arguments for the vllm serve
command. A list of all possible arguments can be found here.
API usage​
Your endpoint lives at:
https://<name>.<NAMESPACE>.cgc-waw-01.comtegra.cloud/
Fetch the API token:
cgc compute list -d
Example call​
curl -H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-X POST \
-d '{
"model": "/media/models/Llama-3.2-3B-Instruct",
"messages": [
{"role":"user","content":"Hello!"}
]
}' \
https://<name>.<NAMESPACE>.cgc-waw-01.comtegra.cloud/v1/chat/completions