Llama.cpp
Llama.cpp is the go-to solution for performant large language model inference. In this guide we'll go through the steps necessary to deploy an OpenAI-compatible service using Comtegra GPU Cloud.
Choose and load a model​
First, choose a model to use for text completion.
Model choice is out-of-scope of this guide, so we'll use
Meta-Llama-3.1-8B-Instruct-Q8_0
as an example.
Take note of the amount of disk space required for the chosen model. Our example model will fit in a 10 GiB volume. Let's create it.
cgc volume create -s 10 llms
We'll now load the file into the new volume. A quick way of doing that is downloading it straight to the new volume. For this, we can use a File Browser instance that we'll SSH into.
# Create if it doesn't exist
cgc compute filebrowser create
# Mount the new volume
cgc volume mount -t filebrowser llms
Make sure your SSH Access is configured and connect to filebrowser.
ssh -t cgc@cgc-api.comtegra.cloud -p 2222 filebrowser
Change directory to the root directory of the new volume and download the model.
cd /srv/llms
wget 'https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf'
Run the server​
Now, run the OpenAI-compatible server.
cgc compute create -c 4 -m 8 -gt A5000 -g 1 -v llms -n llm-server llama-cpp -d model=Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -d n_gpu_layers=999 -d parallel=10 -d ctx_size=81920
g
,-gt
- GPU count and type. Be sure that vRAM is sufficient for the model.-v
- volume with models-n
- instance name that will be used in URLs-c
- CPU cores, no more than 4 is needed in most cases-m
- memory, no more than 8 GiB is needed in most cases-d
- flag to define environment variables
Environment variables​
model
- path to the model filen_gpu_layers
- number of layers to store in VRAM.999
means all layers (-1
does not work).parallel
- the total number of slots for processing requests.ctx_size
- total size of the prompt context, for all slots. Should be a multiple of 128.
Please note that with parallel=1
(the default) only one request can be served
concurrently.
If there's a prediction in progress, other requests will be connected but not
served until the one in progress completes.
Multiple slots enable concurrent processing but they effectively slice the
context size into equal parts.
E.g. a 32k context with 4 slots results in 8k context for each slot.
The context size must be small enough to fit on the chosen GPU.
Below we provide a few example configurations.
GPU | Model | Max. ctx. size |
---|---|---|
A100 (80 GB) | Meta-Llama-3.1-70B-Instruct-Q5_K_M | 71680 (70k) |
A100 (80 GB) | Meta-Llama-3.1-8B-Instruct-Q8_0 | 363520 (355k) |
A5000 (24 GB) | Meta-Llama-3.1-8B-Instruct-Q8_0 | 79872 (78k) |
A5000 (24 GB) | Bielik-2.2-11B-Instruct-Q8_0 | 39936 (39k) |
API​
The API is available at https://llm-server.NAMESPACE.cgc-waw-01.comtegra.cloud/
.
There is also a web UI available.
The API key is shown in cgc compute list -d
.