Llama.cpp

Llama.cpp is the go-to solution for performant large language model inference. In this guide we'll go through the steps necessary to deploy an OpenAI-compatible service using Comtegra GPU Cloud.

Choose and load a model

First, choose a model to use for text completion. Model choice is out-of-scope of this guide, so we'll use Meta-Llama-3.1-8B-Instruct-Q8_0 as an example.

Take note of the amount of disk space required for the chosen model. Our example model will fit in a 10 GiB volume. Let's create it.

cgc volume create -s 10 llms

We'll now load the file into the new volume. A quick way of doing that is downloading it straight to the new volume. For this, we can use a File Browser instance that we'll SSH into.

# Create if it doesn't exist
cgc compute filebrowser create

# Mount the new volume
cgc volume mount -t filebrowser llms

Make sure your SSH Access is configured and connect to filebrowser.

ssh -t cgc@cgc-api.comtegra.cloud -p 2222 filebrowser

Change directory to the root directory of the new volume and download the model.

cd /srv/llms
wget 'https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf'

Run the server

Now, run the OpenAI-compatible server.

cgc compute create -c 4 -m 8 -gt A5000 -g 1 -v llms -n llm-server llama-cpp -d model=/models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -d n_gpu_layers=999 -d parallel=10 -d ctx_size=81920

g, -gt - GPU count and type. Be sure that vRAM is sufficient for the model.
-v - volume with models
-n - instance name that will be used in URLs
-c - CPU cores, no more than 4 is needed in most cases
-m - memory, no more than 8 GiB is needed in most cases
-d - flag to define environment variables

Environment variables

model - path to the model file
n_gpu_layers - number of layers to store in VRAM. 999 means all layers (-1 does not work).
parallel - the total number of slots for processing requests.
ctx_size - total size of the prompt context, for all slots. Should be a multiple of 128.
jinja - if define with any value, the server enable Jinja parser.

Please note that with parallel=1 (the default) only one request can be served concurrently. If there's a prediction in progress, other requests will be connected but not served until the one in progress completes. Multiple slots enable concurrent processing but they effectively slice the context size into equal parts. E.g. a 32k context with 4 slots results in 8k context for each slot. The context size must be small enough to fit on the chosen GPU. Below we provide a few example configurations.

GPU	Model	Max. ctx. size
A100 (80 GB)	Meta-Llama-3.1-70B-Instruct-Q5_K_M	71680 (70k)
A100 (80 GB)	Meta-Llama-3.1-8B-Instruct-Q8_0	363520 (355k)
A5000 (24 GB)	Meta-Llama-3.1-8B-Instruct-Q8_0	79872 (78k)
A5000 (24 GB)	Bielik-2.2-11B-Instruct-Q8_0	39936 (39k)

API

The API is available at https://llm-server.NAMESPACE.cgc-waw-01.comtegra.cloud/. There is also a web UI available. The API key is shown in cgc compute list -d.

Llama.cpp

Choose and load a model​

Run the server​

Environment variables​

API​

Choose and load a model

Run the server

Environment variables

API