llama.clj is a clojure wrapper for the llama.cpp library.
deps.edn dependency:
All of the docs assume the following requires:
Throughout these docs, we'll be using the qwen 0.5b instruct model. and the following context based on this model.
The llama.clj API is built around two functions, llama/create-context
and llama/generate-tokens
. The create-context
builds a context that can be used (and reused) to generate tokens.
The llama/create-context
has two arities:
If no opts
are specified, then defaults will be used.
The model-path
arg should be a string path (relative or absolute) to a gguf or ggml model.
The default context size is 512 tokens, which can be limiting. To increase the context size, provide :n-ctx
as an option during context creation.
The max context size of the model can be used by passing 0
for :n-ctx
.
Most chat or instruct models expect a specific prompt format. llama.cpp
provides limited support for applying chat templates. The chat-apply-template
offers templates for many popular models and formats. Some less popular models may require custom templating and is not included.
Many newer gguf models include the prompt format they expect in their metadata:
If the template is included and llama.cpp recognizes it, then the template can be applied using llama/chat-apply-template
.
Typical roles are "assistant", "system", and "user". It is best to check the documentation for your particular model to see which roles are available. Also note that llama.cpp's template detection isn't exact and may guess incorrectly in some cases.
Even if a model doesn't include a particular template, many models use one of the popular template formats. In those cases, you can pass in a template name.
See the doc string of chat-apply-template
for a list of allowed template names.
Once a context is created, it can then be passed to llama/generate-tokens
. The llama/generate-tokens
function returns seqable or reducible sequence of tokens given a prompt. That means generated tokens can be processed using all of the normal clojure sequence and transducer based functions.
Working with raw tokens is useful in some cases, but most of the time, it's more useful to work with a generated sequence of strings corresponding to those tokens. Lllama.clj provides a simple wrapper of llama/generate-tokens
for that purpose, llama/generate
.
If results don't need to be streamed, then llama/generate-string
can be used to return a string with all the generated text up to the max context size.
By default, llama.cpp's logs are sent to stderr (note: stderr is different from *err*
and System/err
). The log output can be redirected by setting a log callback.
To generate embeddings, contexts must be created with :embedding
set to true
.
This exception means that the maximum number of tokens for a particular context have been generated and that no more tokens can be generated. There are many options for handling generation beyond the context size that are beyond the scope of this documentation. However, one easy option is to increase the context size of the context if the context size is not already at its maximum (see :n-ctx). The maximum context size will depend on your hardware and the model. However, there are tradeoffs to larger context sizes that can be mitigated with other techniques. The Local LLama subreddit can be a good resource for practical tips.