com.phronemophobic.llama
bos
(bos)(bos ctx)Returns the llama beginning of sentence token.
Calling bos without a context is deprecated as not all models use the same bos token.
chat-apply-template
(chat-apply-template template messages)(chat-apply-template template messages opts)Returns a string with chat messages formatted using the format associated with ctx.
Args:
template: A llama context or a template name. Templates names
are one of:
#{"chatml", "llama2", "phi3", "zephyr", "monarch", "gemma", "orion", "openchat", "vicuna", "deepseek", "command-r", "llama3"}
messages: a sequence of chat messages. chat messages are maps with :role and :content.
Typical roles are "assistant", "system", and "user".
opts: A map with the following options:
:append-start-assistant-message?: Whether to end the prompt with the token(s) that
indicate the start of an assistant message.
If omitted, defaults to true.
Throws IllegalArgumentException if the template format is unsupported.
See: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
Throws UnsupportedOperationException for ggml models.
Example: (chat-apply-template ctx {:role "assistant" :content "You are a friendly, helpful assistant."} {:role "user" :content "What is clojure?"} true)
create-context
(create-context model-path)(create-context model-path {:keys [seed n-ctx n-batch n-gpu-layers main-gpu tensor-split rope-freq-base rope-freq-scale low-vram mul_mat_q f16-kv logits-all vocab-only use-mmap use-mlock embedding gqa rms-norm-eps model-format], :as params})Create and return an opaque llama context.
model-path should be an absolute or relative path to a ggml or gguf model.
An optional map of parameters may be passed for parameterizing the model. The following keys map to their corresponding llama.cpp equivalents:
-
:n-ctx: text context, 0 = from model -
:n-batch: logical maximum batch size that can be submitted to llama_decode -
:n-ubatch: physical maximum batch size -
:n-threads: number of threads to use for generation -
:n-threads-batch: number of threads to use for batch processing -
:n-gpu-layers: number of layers to store in VRAM -
:main-gpu: the GPU that is used for the entire model when split_mode is LLAMA_SPLIT_MODE_NONE -
:tensor-split: how to split layers across multiple GPUs -
:vocab-only: only load the vocabulary, no weights -
:use-mmap: use mmap if possible -
:use-mlock: force system to keep model in RAM -
:check-tensors: validate model tensor data
// ref: https://github.com/ggerganov/llama.cpp/pull/2054
-
:rope-freq-base: RoPE base frequency, 0 = from model -
:rope-freq-scale: RoPE frequency scaling factor, 0 = from model -
:yarn-ext-factor: YaRN extrapolation mix factor, negative = from model -
:yarn-attn-factor: YaRN magnitude scaling factor -
:yarn-beta-fast: YaRN low correction dim -
:yarn-beta-slow: YaRN high correction dim -
:yarn-orig-ctx: YaRN original context size -
:defrag-thold: defragment the KV cache if holes/size > thold, < 0 disabled (default) -
:logits-all: the llama_decode() call computes all logits, not just the last one (DEPRECATED - set llama_batch.logits instead) -
:embeddings: if true, extract embeddings (together with logits) -
:offload-kqv: whether to offload the KQV ops (including the KV cache) to GPU -
:flash-attn: whether to use flash attention EXPERIMENTAL -
:no-perf: whether to measure performance timings
The :model-format can be specified as either :ggml or :gguf. If not provided,
the model format will be guessed by looking at model-path.
Resources can be freed by calling .close on the returned context. Using a closed context is undefined and will probably crash the JVM.
Contexts are not thread-safe. Using the same context on multiple threads is undefined and will probably crash the JVM.
decode-token
(decode-token ctx)(decode-token ctx opts)Returns a transducer that expects a stream of llama tokens and outputs a stream of strings.
The transducer will buffer intermediate results until enough bytes to decode a character are available. Also combines surrogate pairs of characters.
decode-token-to-char
(decode-token-to-char ctx)(decode-token-to-char ctx opts)Returns a transducer that expects a stream of llama tokens and outputs a stream of decoded chars.
The transducer will buffer intermediate results until enough bytes to decode a character are available.
end-of-generation?
(end-of-generation? ctx token)Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)
eos
(eos)(eos ctx)Returns the llama end of sentence token.
Calling eos without a context is deprecated as not all models use the same bos token.
generate
(generate ctx prompt)(generate ctx prompt opts)Returns a seqable/reducible sequence of strings generated from ctx with prompt.
generate-embedding
(generate-embedding ctx prompt opts)(generate-embedding ctx prompt)Returns the embedding for a given input prompt.
The context should have been created with the :embedding option set to true.
Note: embeddings are not normalized. See com.phronemophobic.llama.util/normalize-embedding.
generate-string
(generate-string ctx prompt)(generate-string ctx prompt opts)Returns a string with all tokens generated from prompt up until end of sentence or max context size.
generate-tokens
(generate-tokens ctx prompt)(generate-tokens ctx prompt {:keys [samplef num-threads seed], :as opts})Returns a seqable/reducible sequence of tokens from ctx with prompt.
get-embedding
(get-embedding ctx)Returns a copy of the current context's embedding as a float array.
The context should have been created with the :embedding option set to true.
get-logits
(get-logits ctx)Returns a copy of the current context's logits as a float array.
init-mirostat-v2-sampler
(init-mirostat-v2-sampler ctx)(init-mirostat-v2-sampler ctx tau eta)Given a context, returns a sampling function that uses the llama.cpp mirostat_v2 implementation.
llama-update
(llama-update ctx s)(llama-update ctx s n-past)(llama-update ctx s n-past num-threads)Adds s to the current context and updates the context's logits (see get-logits).
s: either be a string or an integer token.
n-past: number of previous tokens to include when updating logits.
num-threads: number of threads to use when updating the logits.
If not provided, or nil, defaults to *num-threads*.
model-size
(model-size ctx)Returns the total size of all the tensors in the model in bytes.
sample-logits-greedy
(sample-logits-greedy logits)Returns the token with the highest value.
logits: a collection of floats representing the logits (see get-logits).
set-log-callback
(set-log-callback cb)Sets the log callback. The callback should be a function that recieves two args: log level and msg. Setting to nil will cause output to be written to stderr. The log callback is global for all contexts.
The log levels are as follows: GGML_LOG_LEVEL_ERROR = 2, GGML_LOG_LEVEL_WARN = 3, GGML_LOG_LEVEL_INFO = 4, GGML_LOG_LEVEL_DEBUG = 5
Only supported for gguf models.
Example: (set-log-callback ctx (fn level msg (println level msg)))