Skip to main content

Llama.cpp

Llama.cpp is the go-to solution for performant large language model inference. In this guide we'll go through the steps necessary to deploy an OpenAI-compatible service using Comtegra GPU Cloud.

Choose and load a model​

First, choose a model to use for text completion. Model choice is out-of-scope of this guide, so we'll use Meta-Llama-3.1-8B-Instruct-Q8_0 as an example.

Take note of the amount of disk space required for the chosen model. Our example model will fit in a 10 GiB volume. Let's create it.

cgc volume create -s 10 llms

We'll now load the file into the new volume. A quick way of doing that is downloading it straight to the new volume. For this, we can use a File Browser instance that we'll SSH into.

# Create if it doesn't exist
cgc compute filebrowser create

# Mount the new volume
cgc volume mount -t filebrowser llms

Make sure your SSH Access is configured and connect to filebrowser.

ssh -t cgc@cgc-api.comtegra.cloud -p 2222 filebrowser

Change directory to the root directory of the new volume and download the model.

cd /srv/llms
wget 'https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf'

Run the server​

Now, run the OpenAI-compatible server.

cgc compute create -c 4 -m 8 -gt A5000 -g 1 -v llms -n llm-server llama-cpp -d model=Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -d n_gpu_layers=999 -d parallel=10 -d ctx_size=81920

The API is available at https://llm-server.NAMESPACE.cgc-waw-01.comtegra.cloud/. There is also a web UI available. The API key is shown in cgc compute list -d.