com.phronemophobic.llama

*num-threads*

dynamic

Number of threads used when generating tokens.

-main

(-main model-path prompt)

bos

(bos)(bos ctx)

Returns the llama beginning of sentence token.

Calling bos without a context is deprecated as not all models use the same bos token.

chat-apply-template

(chat-apply-template template messages)(chat-apply-template template messages opts)

Returns a string with chat messages formatted using the format associated with ctx.

Args: template: A llama context or a template name. Templates names are one of: #{"chatml", "llama2", "phi3", "zephyr", "monarch", "gemma", "orion", "openchat", "vicuna", "deepseek", "command-r", "llama3"}

messages: a sequence of chat messages. chat messages are maps with :role and :content. Typical roles are "assistant", "system", and "user".

opts: A map with the following options: :append-start-assistant-message?: Whether to end the prompt with the token(s) that indicate the start of an assistant message. If omitted, defaults to true.

Throws IllegalArgumentException if the template format is unsupported. See: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template

Throws UnsupportedOperationException for ggml models.

Example: (chat-apply-template ctx {:role "assistant" :content "You are a friendly, helpful assistant."} {:role "user" :content "What is clojure?"} true)

create-context

(create-context model-path)(create-context model-path {:keys [seed n-ctx n-batch n-gpu-layers main-gpu tensor-split rope-freq-base rope-freq-scale low-vram mul_mat_q f16-kv logits-all vocab-only use-mmap use-mlock embedding gqa rms-norm-eps model-format], :as params})

Create and return an opaque llama context.

model-path should be an absolute or relative path to a ggml or gguf model.

An optional map of parameters may be passed for parameterizing the model. The following keys map to their corresponding llama.cpp equivalents:

  • :n-ctx: text context, 0 = from model

  • :n-batch: logical maximum batch size that can be submitted to llama_decode

  • :n-ubatch: physical maximum batch size

  • :n-threads: number of threads to use for generation

  • :n-threads-batch: number of threads to use for batch processing

  • :n-gpu-layers: number of layers to store in VRAM

  • :main-gpu: the GPU that is used for the entire model when split_mode is LLAMA_SPLIT_MODE_NONE

  • :tensor-split: how to split layers across multiple GPUs

  • :vocab-only: only load the vocabulary, no weights

  • :use-mmap: use mmap if possible

  • :use-mlock: force system to keep model in RAM

  • :check-tensors: validate model tensor data

// ref: https://github.com/ggerganov/llama.cpp/pull/2054

  • :rope-freq-base: RoPE base frequency, 0 = from model

  • :rope-freq-scale: RoPE frequency scaling factor, 0 = from model

  • :yarn-ext-factor: YaRN extrapolation mix factor, negative = from model

  • :yarn-attn-factor: YaRN magnitude scaling factor

  • :yarn-beta-fast: YaRN low correction dim

  • :yarn-beta-slow: YaRN high correction dim

  • :yarn-orig-ctx: YaRN original context size

  • :defrag-thold: defragment the KV cache if holes/size > thold, < 0 disabled (default)

  • :logits-all: the llama_decode() call computes all logits, not just the last one (DEPRECATED - set llama_batch.logits instead)

  • :embeddings: if true, extract embeddings (together with logits)

  • :offload-kqv: whether to offload the KQV ops (including the KV cache) to GPU

  • :flash-attn: whether to use flash attention EXPERIMENTAL

  • :no-perf: whether to measure performance timings

The :model-format can be specified as either :ggml or :gguf. If not provided, the model format will be guessed by looking at model-path.

Resources can be freed by calling .close on the returned context. Using a closed context is undefined and will probably crash the JVM.

Contexts are not thread-safe. Using the same context on multiple threads is undefined and will probably crash the JVM.

decode-token

(decode-token ctx)(decode-token ctx opts)

Returns a transducer that expects a stream of llama tokens and outputs a stream of strings.

The transducer will buffer intermediate results until enough bytes to decode a character are available. Also combines surrogate pairs of characters.

decode-token-to-char

(decode-token-to-char ctx)(decode-token-to-char ctx opts)

Returns a transducer that expects a stream of llama tokens and outputs a stream of decoded chars.

The transducer will buffer intermediate results until enough bytes to decode a character are available.

end-of-generation?

(end-of-generation? ctx token)

Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)

eos

(eos)(eos ctx)

Returns the llama end of sentence token.

Calling eos without a context is deprecated as not all models use the same bos token.

generate

(generate ctx prompt)(generate ctx prompt opts)

Returns a seqable/reducible sequence of strings generated from ctx with prompt.

generate-embedding

(generate-embedding ctx prompt opts)(generate-embedding ctx prompt)

Returns the embedding for a given input prompt.

The context should have been created with the :embedding option set to true.

Note: embeddings are not normalized. See com.phronemophobic.llama.util/normalize-embedding.

generate-string

(generate-string ctx prompt)(generate-string ctx prompt opts)

Returns a string with all tokens generated from prompt up until end of sentence or max context size.

generate-tokens

(generate-tokens ctx prompt)(generate-tokens ctx prompt {:keys [samplef num-threads seed], :as opts})

Returns a seqable/reducible sequence of tokens from ctx with prompt.

get-embedding

(get-embedding ctx)

Returns a copy of the current context's embedding as a float array.

The context should have been created with the :embedding option set to true.

get-logits

(get-logits ctx)

Returns a copy of the current context's logits as a float array.

init-mirostat-v2-sampler

(init-mirostat-v2-sampler ctx)(init-mirostat-v2-sampler ctx tau eta)

Given a context, returns a sampling function that uses the llama.cpp mirostat_v2 implementation.

llama-update

(llama-update ctx s)(llama-update ctx s n-past)(llama-update ctx s n-past num-threads)

Adds s to the current context and updates the context's logits (see get-logits).

s: either be a string or an integer token. n-past: number of previous tokens to include when updating logits. num-threads: number of threads to use when updating the logits. If not provided, or nil, defaults to *num-threads*.

metadata

(metadata ctx)

Returns a map of the metadata associated with ctx.

model-description

(model-description ctx)

Get a string describing the model type.

model-n-params

(model-n-params ctx)

Returns the total number of parameters in the model.

model-size

(model-size ctx)

Returns the total size of all the tensors in the model in bytes.

n-ctx

(n-ctx ctx)

The context size for the associated model.

n-embd

(n-embd ctx)

The length of the embedding vector for the associated model.

n-vocab

(n-vocab ctx)

The number of available tokens for the associated model.

sample-logits-greedy

(sample-logits-greedy logits)

Returns the token with the highest value.

logits: a collection of floats representing the logits (see get-logits).

set-log-callback

(set-log-callback cb)

Sets the log callback. The callback should be a function that recieves two args: log level and msg. Setting to nil will cause output to be written to stderr. The log callback is global for all contexts.

The log levels are as follows: GGML_LOG_LEVEL_ERROR = 2, GGML_LOG_LEVEL_WARN = 3, GGML_LOG_LEVEL_INFO = 4, GGML_LOG_LEVEL_DEBUG = 5

Only supported for gguf models.

Example: (set-log-callback ctx (fn level msg (println level msg)))

set-rng-seed

(set-rng-seed ctx seed)

Manually set the rng seed for a context.