LLM inference API

We provide an LLM inference API as an additional service. The API is partially compatible with the OpenAI API. The base URL is https://llm.comtegra.cloud/v1.

The supported endpoints are:

Requests are authenticated using a bearer token. To start using the API you only need to generate it via the CGC client.

Usage examples

Chat

Python
curl

from openai import OpenAI

client = OpenAI(base_url="https://llm.comtegra.cloud/v1",
    api_key="YOUR-API-SECRET")

res = client.chat.completions.create(model="llama3-8b", max_completion_tokens=100,
    messages=[{"role": "user", "content": "Hi"}])

print(res)

curl \
    -H 'Authorization: Bearer YOUR-API-SECRET' \
    -d '{"model": "llama3-8b", "messages": [{"role": "user", "content": "Hi"}]}' \
    'https://llm.comtegra.cloud/v1/chat/completions'

warning

These instructions are for the curl program. In Microsoft's Power Shell curl is an alias to Invoke-WebRequest, which is not compatible with the proper curl.

Embeddings

Python
curl

from openai import OpenAI

client = OpenAI(base_url="https://llm.comtegra.cloud/v1",
    api_key="YOUR-API-SECRET")

res = client.embeddings.create(model="gte-qwen2-7b", input="Mary had a little lamb")

print(res)

curl \
    -H 'Authorization: Bearer YOUR-API-SECRET' \
    -d '{"model": "gte-qwen2-7b","input": "Mary had a little lamb", "pooling": "mean"}' \
    'https://llm.comtegra.cloud/v1/embeddings'

warning

These instructions are for the curl program. In Microsoft's Power Shell curl is an alias to Invoke-WebRequest, which is not compatible with the proper curl.

Transcriptions

Python
curl

from openai import OpenAI

client = OpenAI(base_url="https://llm.comtegra.cloud/v1",
    api_key="YOUR-API-SECRET")

with open("recording.mp3", "rb") as f:
    res = client.audio.transcriptions.create(model="whisper-1", file=f)

print(res)

curl \
    -H 'Authorization: Bearer YOUR-API-SECRET' \
    -F file=@'recording.mp3' \
    -F model='whisper-1' \
    'https://llm.comtegra.cloud/v1/audio/transcriptions'

warning

These instructions are for the curl program. In Microsoft's Power Shell curl is an alias to Invoke-WebRequest, which is not compatible with the proper curl.

API keys

cgc api-keys create --level LLM

The output will contain an API secret. Save it somewhere convenient and safe. You'll need it to present it for every request you make to the API. There's no way to retrieve it a second time. If you lose it, delete it and create another one. Keep it secret as using it generates costs for your organization.

You may add a comment to a new API key so that it's easier to identify later.

cgc api-keys create --level LLM --comment "Mike's key"

See cgc api-keys --help for all commands and options.

Billing

Requests are billed on a per-token basis. The price per token depends on the model you use and the GPU it's running on. The larger the model, the higher the price. The faster the GPU, the higher the price. Completion (output) tokens are added to your organization's invoice as well as prompt (input) tokens. Price list is available on our pricing page.

You can view token usage and cost using cgc billing status.

For example, you use the Meta-Llama 3.1-70B-Instruct-Q5_K_M model running on a NVIDIA A100 GPU. Let's assume prompt token price is 19,78 zł / 1M tok. and completion token price is 167,66 zł / 1M tok. Your prompt is: Write a haiku about ChatGPT, which is 18 tokens long. You get the following response: Silicon whispers. ChatGPT's gentle responses. Knowledge at my door, which is 16 tokens long. Your organization will be billed 19,78 zł * 18 / 1000000 + 167,66 zł * 16 / 1000000 = 0,003039 zł for this request.

Self-managed instance

If you’d like to use multiple models in your own namespace with access control and per-user token logs you might want to consider running a self-managed instance of LLM API.

Administration tasks such as user management and usage monitoring require SSH access to your LLM API container. You may configure it by following the instructions on the SSH Access page.

Creating a self-managed instance

First, create a volume for storing the configuration file and database.

cgc volume create -s 1 llm-api-data

Next, create the self-managed LLM API instance.

cgc compute create -n llm-api -c 1 -m 2 -v llm-api-data llm-api

By default, the volume will be mounted in /media/api. When the application starts for the first time, it will create a configuration file in /media/api/config.toml and initialize a database in /media/api/db.sqlite. The default configuration file contains comments explaining the available options and some examples.

You may use File Browser to edit the configuration file. For the application to reload the modified the config file you may use cgc resource restart. Alternatively, you may reload the backends only without restarting the application by sending it the HUP signal. In order to do that, log into your LLM API container using SSH and run kill -1 1.

You may add the -d config=/path/to/config/file argument to cgc compute create llm-api to set a custom path to your configuration file.

User management

To allow a new user to access the API you may create a new API key for them. Log into your LLM API container. Running the following command will create a new API key. It’s strongly recommended to add a comment to each key, indicating the user or purpose of the key. Without comments keys are very hard to distinguish. You may also add an expiration date so that the new key will not be valid after this date.

llmproxyctl user create --comment 'John Smith' --expires '2030-01-01 13:37'

# with shorthand argument forms
llmproxyctl user create -t 'John Smith' -e '2030-01-01 13:37'

The program will print the details of the newly created user and the associated randomly generated API key.

User created
Expires: 2030-01-01 12:37:00+00:00
Comment: John Smith
Hash: dd2a60b77cb2
Plain API key: 3FpwuzyXdU-u3bm8hg9ipA3ZB7JFWSqYHPRw1EMeQsB-XuV5cJ_...

Please note that this is the only time you’ll see the plain API key. It’s not possible to recover it if it’s lost because it’s stored in the database in hashed form.

You may list all keys with the following command.

llmproxyctl user list

If you wish to edit the comment or expiration date of a user you may use the llmproxyctl user update command. First, look at the Hash (first) column in the user list. To select a user to edit, use the first few characters of the corresponding hash. Please see the following example.

$ llmproxyctl user list
Hash          Expires               Status   Comment
------------  --------------------  -------  -------
dd2a60b77cb2  2030-01-01 12:37:00   active   John Smith
a26a23f52cd8  -                     active   Sam Altman
da7534bd88a0  2022-10-26 22:00:00   expired  Jack Dorosey

$ llmproxyctl user update -t "Elon Musk" -e '2030-01-01 00:00' da
User updated

$ llmproxyctl user list
Hash          Expires               Status   Comment
------------  --------------------  -------  -------
dd2a60b77cb2  2030-01-01 12:37:00   active   John Smith
a26a23f52cd8  -                     active   Sam Altman
da7534bd88a0  2030-01-01 00:00:00   active   Elon Musk

If you wish to disable a user, you may set their expiration date to a special value now.

llmproxyctl user update -e now da

LLM inference API

Usage examples​

Chat​

Embeddings​

Transcriptions​

API keys​

Billing​

Self-managed instance​

Creating a self-managed instance​

User management​

Usage examples

Chat

Embeddings

Transcriptions

API keys

Billing

Self-managed instance

Creating a self-managed instance

User management