Skip to main content

Llama.cpp

Llama.cpp is the go-to solution for performant large language model inference. In this guide we'll go through the steps necessary to deploy an OpenAI-compatible service using Comtegra GPU Cloud.

Choose and load a model​

First, choose a model to use for text completion. Model choice is out-of-scope of this guide, so we'll use Meta-Llama-3.1-8B-Instruct-Q8_0 as an example.

Take note of the amount of disk space required for the chosen model. Our example model will fit in a 10 GiB volume. Let's create it.

cgc volume create -s 10 llms

We'll now load the file into the new volume. A quick way of doing that is downloading it straight to the new volume. For this, we can use a File Browser instance that we'll SSH into.

# Create if it doesn't exist
cgc compute filebrowser create

# Mount the new volume
cgc volume mount -t filebrowser llms

Make sure your SSH Access is configured and connect to filebrowser.

ssh -t cgc@cgc-api.comtegra.cloud -p 2222 filebrowser

Change directory to the root directory of the new volume and download the model.

cd /srv/llms
wget 'https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf'

Run the server​

Now, run the OpenAI-compatible server.

cgc compute create -c 4 -m 8 -gt A5000 -g 1 -v llms -n llm-server llama-cpp -d model=Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -d n_gpu_layers=999 -d parallel=10 -d ctx_size=81920
  • g, -gt - GPU count and type. Be sure that vRAM is sufficient for the model.
  • -v - volume with models
  • -n - instance name that will be used in URLs
  • -c - CPU cores, no more than 4 is needed in most cases
  • -m - memory, no more than 8 GiB is needed in most cases
  • -d - flag to define environment variables

Environment variables​

  • model - path to the model file
  • n_gpu_layers - number of layers to store in VRAM. 999 means all layers (-1 does not work).
  • parallel - the total number of slots for processing requests.
  • ctx_size - total size of the prompt context, for all slots. Should be a multiple of 128.

Please note that with parallel=1 (the default) only one request can be served concurrently. If there's a prediction in progress, other requests will be connected but not served until the one in progress completes. Multiple slots enable concurrent processing but they effectively slice the context size into equal parts. E.g. a 32k context with 4 slots results in 8k context for each slot. The context size must be small enough to fit on the chosen GPU. Below we provide a few example configurations.

GPUModelMax. ctx. size
A100 (80 GB)Meta-Llama-3.1-70B-Instruct-Q5_K_M71680 (70k)
A100 (80 GB)Meta-Llama-3.1-8B-Instruct-Q8_0363520 (355k)
A5000 (24 GB)Meta-Llama-3.1-8B-Instruct-Q8_079872 (78k)
A5000 (24 GB)Bielik-2.2-11B-Instruct-Q8_039936 (39k)

API​

The API is available at https://llm-server.NAMESPACE.cgc-waw-01.comtegra.cloud/. There is also a web UI available. The API key is shown in cgc compute list -d.