com.phronemophobic.llama
bos
(bos)
(bos ctx)
Returns the llama beginning of sentence token.
Calling bos
without a context is deprecated as not all models use the same bos token.
chat-apply-template
(chat-apply-template template messages)
(chat-apply-template template messages opts)
Returns a string with chat messages
formatted using the format associated with ctx
.
Args:
template
: A llama context or a template name. Templates names
are one of:
#{"chatml", "llama2", "phi3", "zephyr", "monarch", "gemma", "orion", "openchat", "vicuna", "deepseek", "command-r", "llama3"}
messages
: a sequence of chat messages. chat messages are maps with :role
and :content
.
Typical roles are "assistant", "system", and "user".
opts
: A map with the following options:
:append-start-assistant-message?
: Whether to end the prompt with the token(s) that
indicate the start of an assistant message.
If omitted, defaults to true.
Throws IllegalArgumentException
if the template format is unsupported.
See: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
Throws UnsupportedOperationException
for ggml models.
Example: (chat-apply-template ctx {:role "assistant" :content "You are a friendly, helpful assistant."} {:role "user" :content "What is clojure?"} true)
create-context
(create-context model-path)
(create-context model-path {:keys [seed n-ctx n-batch n-gpu-layers main-gpu tensor-split rope-freq-base rope-freq-scale low-vram mul_mat_q f16-kv logits-all vocab-only use-mmap use-mlock embedding gqa rms-norm-eps model-format], :as params})
Create and return an opaque llama context.
model-path
should be an absolute or relative path to a ggml or gguf model.
An optional map of parameters may be passed for parameterizing the model. The following keys map to their corresponding llama.cpp equivalents:
-
:n-ctx
: text context, 0 = from model -
:n-batch
: logical maximum batch size that can be submitted to llama_decode -
:n-ubatch
: physical maximum batch size -
:n-threads
: number of threads to use for generation -
:n-threads-batch
: number of threads to use for batch processing -
:n-gpu-layers
: number of layers to store in VRAM -
:main-gpu
: the GPU that is used for the entire model when split_mode is LLAMA_SPLIT_MODE_NONE -
:tensor-split
: how to split layers across multiple GPUs -
:vocab-only
: only load the vocabulary, no weights -
:use-mmap
: use mmap if possible -
:use-mlock
: force system to keep model in RAM -
:check-tensors
: validate model tensor data
// ref: https://github.com/ggerganov/llama.cpp/pull/2054
-
:rope-freq-base
: RoPE base frequency, 0 = from model -
:rope-freq-scale
: RoPE frequency scaling factor, 0 = from model -
:yarn-ext-factor
: YaRN extrapolation mix factor, negative = from model -
:yarn-attn-factor
: YaRN magnitude scaling factor -
:yarn-beta-fast
: YaRN low correction dim -
:yarn-beta-slow
: YaRN high correction dim -
:yarn-orig-ctx
: YaRN original context size -
:defrag-thold
: defragment the KV cache if holes/size > thold, < 0 disabled (default) -
:logits-all
: the llama_decode() call computes all logits, not just the last one (DEPRECATED - set llama_batch.logits instead) -
:embeddings
: if true, extract embeddings (together with logits) -
:offload-kqv
: whether to offload the KQV ops (including the KV cache) to GPU -
:flash-attn
: whether to use flash attention EXPERIMENTAL -
:no-perf
: whether to measure performance timings
The :model-format
can be specified as either :ggml
or :gguf
. If not provided,
the model format will be guessed by looking at model-path
.
Resources can be freed by calling .close on the returned context. Using a closed context is undefined and will probably crash the JVM.
Contexts are not thread-safe. Using the same context on multiple threads is undefined and will probably crash the JVM.
decode-token
(decode-token ctx)
(decode-token ctx opts)
Returns a transducer that expects a stream of llama tokens and outputs a stream of strings.
The transducer will buffer intermediate results until enough bytes to decode a character are available. Also combines surrogate pairs of characters.
decode-token-to-char
(decode-token-to-char ctx)
(decode-token-to-char ctx opts)
Returns a transducer that expects a stream of llama tokens and outputs a stream of decoded chars.
The transducer will buffer intermediate results until enough bytes to decode a character are available.
end-of-generation?
(end-of-generation? ctx token)
Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)
eos
(eos)
(eos ctx)
Returns the llama end of sentence token.
Calling eos
without a context is deprecated as not all models use the same bos token.
generate
(generate ctx prompt)
(generate ctx prompt opts)
Returns a seqable/reducible sequence of strings generated from ctx with prompt.
generate-embedding
(generate-embedding ctx prompt opts)
(generate-embedding ctx prompt)
Returns the embedding for a given input prompt.
The context should have been created with the :embedding
option set to true.
Note: embeddings are not normalized. See com.phronemophobic.llama.util/normalize-embedding.
generate-string
(generate-string ctx prompt)
(generate-string ctx prompt opts)
Returns a string with all tokens generated from prompt up until end of sentence or max context size.
generate-tokens
(generate-tokens ctx prompt)
(generate-tokens ctx prompt {:keys [samplef num-threads seed], :as opts})
Returns a seqable/reducible sequence of tokens from ctx with prompt.
get-embedding
(get-embedding ctx)
Returns a copy of the current context's embedding as a float array.
The context should have been created with the :embedding
option set to true.
get-logits
(get-logits ctx)
Returns a copy of the current context's logits as a float array.
init-mirostat-v2-sampler
(init-mirostat-v2-sampler ctx)
(init-mirostat-v2-sampler ctx tau eta)
Given a context, returns a sampling function that uses the llama.cpp mirostat_v2 implementation.
llama-update
(llama-update ctx s)
(llama-update ctx s n-past)
(llama-update ctx s n-past num-threads)
Adds s
to the current context and updates the context's logits (see get-logits
).
s
: either be a string or an integer token.
n-past
: number of previous tokens to include when updating logits.
num-threads
: number of threads to use when updating the logits.
If not provided, or nil
, defaults to *num-threads*
.
model-size
(model-size ctx)
Returns the total size of all the tensors in the model in bytes.
sample-logits-greedy
(sample-logits-greedy logits)
Returns the token with the highest value.
logits
: a collection of floats representing the logits (see get-logits
).
set-log-callback
(set-log-callback cb)
Sets the log callback. The callback should be a function that recieves two args: log level and msg. Setting to nil will cause output to be written to stderr. The log callback is global for all contexts.
The log levels are as follows: GGML_LOG_LEVEL_ERROR = 2, GGML_LOG_LEVEL_WARN = 3, GGML_LOG_LEVEL_INFO = 4, GGML_LOG_LEVEL_DEBUG = 5
Only supported for gguf models.
Example: (set-log-callback ctx (fn level msg (println level msg)))