Skip to main content

VLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.

Choose and load a model​

Pick any model you want to serve. In this example we'll use Llama‑3.2‑3B‑Instruct (~7 GiB).

Create a volume​

cgc volume create -s 15 models

Put the model in the volume​

The quickest way is to do it manually using the filebrowser GUI. You can run:

# Create if it doesn't exist
cgc compute filebrowser create

# Mount the new volume
cgc volume mount models -t filebrowser

To check app token and url under which its available:

cgc compute list -d

When logging into filebrowser's web interface use username admin and the app token as a password.

In the filebrowser, download all the model files from Hugging Face and drag them into the volume.

note

Ensure you download the model in Hugging Face's original PyTorch format, not GGUF.

How to run​

cgc compute create -n <name> -c 4 -m 8 -g 1 -gt A5000 -v models vllm-openai -d model=/media/models/Llama-3.2-3B-Instruct

Parameters:

  • -n - instance name that will be used in URLs
  • -c - CPU cores, no more than 4 is needed in most cases
  • -m - memory, no more than 8 GiB is needed in most cases
  • -g, -gt - GPU count and type. Be sure that vRAM is sufficient for the model.
  • -v - volume with models
  • -d - flag to define environment variables

Environment variables​

  • model - path to the model directory (containing config.json and weight files)

API usage​

Your endpoint lives at:

https://<name>.<NAMESPACE>.cgc-waw-01.comtegra.cloud/

Fetch the API token:

cgc compute list -d

Example call​

curl -H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-X POST \
-d '{
"model": "/media/models/Llama-3.2-3B-Instruct",
"messages": [
{"role":"user","content":"Hello!"}
]
}' \
https://<name>.<NAMESPACE>.cgc-waw-01.comtegra.cloud/v1/chat/completions