Llama.cpp
Llama.cpp is the go-to solution for performant large language model inference. In this guide we'll go through the steps necessary to deploy an OpenAI-compatible service using Comtegra GPU Cloud.
Choose and load a model​
First, choose a model to use for text completion.
Model choice is out-of-scope of this guide, so we'll use
Meta-Llama-3.1-8B-Instruct-Q8_0
as an example.
Take note of the amount of disk space required for the chosen model. Our example model will fit in a 10 GiB volume. Let's create it.
cgc volume create -s 10 llms
We'll now load the file into the new volume. A quick way of doing that is downloading it straight to the new volume. For this, we can use a File Browser instance that we'll SSH into.
# Create if it doesn't exist
cgc compute filebrowser create
# Mount the new volume
cgc volume mount -t filebrowser llms
Make sure your SSH Access is configured and connect to filebrowser.
ssh -t cgc@cgc-api.comtegra.cloud -p 2222 filebrowser
Change directory to the root directory of the new volume and download the model.
cd /srv/llms
wget 'https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf'
Run the server​
Now, run the OpenAI-compatible server.
cgc compute create -c 4 -m 8 -gt A5000 -g 1 -v llms -n llm-server llama-cpp -d model=Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -d n_gpu_layers=999 -d parallel=10 -d ctx_size=81920
The API is available at https://llm-server.NAMESPACE.cgc-waw-01.comtegra.cloud/
.
There is also a web UI available.
The API key is shown in cgc compute list -d
.