ToC
Generated with Clerk from notebooks/usage.clj

llama.clj

llama.clj is a clojure wrapper for the llama.cpp library.

Dependency

deps.edn dependency:

com.phronemophobic/llama-clj {:mvn/version "0.8.5"}

Requires

All of the docs assume the following requires:

(require '[com.phronemophobic.llama :as llama])

Throughout these docs, we'll be using the qwen 0.5b instruct model. and the following context based on this model.

;; downloaded previously from
;; https://huggingface.co/Qwen/Qwen2-0.5B-Instruct-GGUF/resolve/main/qwen2-0_5b-instruct-q4_k_m.gguf?download=true
(def model-path "models/qwen2-0_5b-instruct-q4_k_m.gguf")
;; Use larger context size of 2048.
(def llama-context (llama/create-context model-path {:n-ctx 2048}))

Overview

The llama.clj API is built around two functions, llama/create-context and llama/generate-tokens. The create-context builds a context that can be used (and reused) to generate tokens.

Context Creation

The llama/create-context has two arities:

(llama/create-context model-path)
(llama/create-context model-path opts)

If no opts are specified, then defaults will be used.

The model-path arg should be a string path (relative or absolute) to a gguf or ggml model.

Context Size

The default context size is 512 tokens, which can be limiting. To increase the context size, provide :n-ctx as an option during context creation.

;; Use context size of 2048 tokens
(llama/create-context model-path {:n-ctx 2048})

The max context size of the model can be used by passing 0 for :n-ctx.

;; Use model's max context size.
(llama/create-context model-path {:n-ctx 0})

Prompt Templates

Most chat or instruct models expect a specific prompt format. llama.cpp provides limited support for applying chat templates. The chat-apply-template offers templates for many popular models and formats. Some less popular models may require custom templating and is not included.

Model Provided Templates

Many newer gguf models include the prompt format they expect in their metadata:

(get (llama/metadata llama-context) "tokenizer.chat_template")
"
{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system175 more elided"

If the template is included and llama.cpp recognizes it, then the template can be applied using llama/chat-apply-template.

(llama/chat-apply-template llama-context
[{:role "user"
:content "What's the best way to code in clojure?"}])
"
<|im_start|>user↩︎What's the best way to code in clojure?<|im_end|>↩︎<|im_start|>a9 more elided"

Typical roles are "assistant", "system", and "user". It is best to check the documentation for your particular model to see which roles are available. Also note that llama.cpp's template detection isn't exact and may guess incorrectly in some cases.

Applying Templates By Name

Even if a model doesn't include a particular template, many models use one of the popular template formats. In those cases, you can pass in a template name.

(llama/chat-apply-template "llama3"
[{:role "user"
:content "What's the best way to code in clojure?"}])
"
<|start_header_id|>user<|end_header_id|>↩︎↩︎What's the best way to code in clojure58 more elided"

See the doc string of chat-apply-template for a list of allowed template names.

Token Generation

Once a context is created, it can then be passed to llama/generate-tokens. The llama/generate-tokens function returns seqable or reducible sequence of tokens given a prompt. That means generated tokens can be processed using all of the normal clojure sequence and transducer based functions.

(def hello-world-prompt
(llama/chat-apply-template llama-context
[{:role "user"
:content "Hello World"}]))
"
<|im_start|>user↩︎Hello World<|im_end|>↩︎<|im_start|>assistant"
(first (llama/generate-tokens llama-context hello-world-prompt))
9707
(clojure.string/join
(eduction
(llama/decode-token llama-context)
(take 10)
(llama/generate-tokens llama-context hello-world-prompt)))
Hello! How

Generating Text

Working with raw tokens is useful in some cases, but most of the time, it's more useful to work with a generated sequence of strings corresponding to those tokens. Lllama.clj provides a simple wrapper of llama/generate-tokens for that purpose, llama/generate.

(def haiku-prompt
(llama/chat-apply-template
llama-context
[{:role "user"
:content "Write a short poem about documentation."}]))
"
<|im_start|>user↩︎Write a short poem about documentation.<|im_end|>↩︎<|im_start|>a9 more elided"
(into [] (take 5) (llama/generate llama-context haiku-prompt))
[\D \o \c \u \m]

If results don't need to be streamed, then llama/generate-string can be used to return a string with all the generated text up to the max context size.

(llama/generate-string llama-context haiku-prompt)
Documentation, a journey through the years,
A compass to navigate, a stream that flows,
A path of records and stories, a silent testimony to history, for time, forever changing.
Documentation, a gift of clarity,
A story of growth and progress, a promise to a better tomorrow,
To share, to celebrate, to give voice to all, we navigate the wilderness.

Log Callback

By default, llama.cpp's logs are sent to stderr (note: stderr is different from *err* and System/err). The log output can be redirected by setting a log callback.

;; disable logging
(llama/set-log-callback (fn [& args]))
;; print to stdout
(llama/set-log-callback
(fn [log-level msg]
(let [level-str (case log-level
2 "error"
3 "warn"
4 "info"
5 "debug")]
(println log-level msg))))

Generating Embeddings

To generate embeddings, contexts must be created with :embedding set to true.

(def llama-embedding-context
(llama/create-context model-path
{:embedding true}))
#object[com.phronemophobic.llama.raw_gguf.proxy$com.sun.jna.Pointer$ILookup$ILLamaContext$AutoCloseable$49378b1 0x64fb0c37 "
native@0x7fb1ef70f770"
]
(vec
(llama/generate-embedding llama-embedding-context "hello world"))
[2.6690612 4.1604795 -5.831996 -3.5878224 1.4617678 -3.605975 1.4035625 2.2367547 3.0567505 -4.382091 1.1975286 -1.0328231 -2.4289412 -1.315618 -0.2870509 -0.8754218 -21.075155 -1.2797574 -7.187202 -2.817467 876 more elided]

FAQ

Context size exceeded

This exception means that the maximum number of tokens for a particular context have been generated and that no more tokens can be generated. There are many options for handling generation beyond the context size that are beyond the scope of this documentation. However, one easy option is to increase the context size of the context if the context size is not already at its maximum (see :n-ctx). The maximum context size will depend on your hardware and the model. However, there are tradeoffs to larger context sizes that can be mitigated with other techniques. The Local LLama subreddit can be a good resource for practical tips.