com.phronemophobic.llama

num-threads

dynamic

Number of threads used when generating tokens.

view source

-main

(-main model-path prompt)

view source

bos

(bos)(bos ctx)

Returns the llama beginning of sentence token.

Calling bos without a context is deprecated as not all models use the same bos token.

view source

chat-apply-template

(chat-apply-template template messages)(chat-apply-template template messages opts)

Returns a string with chat messages formatted using the format associated with ctx.

Args: template: A llama context or a template name. Templates names are one of: #{"chatml", "llama2", "phi3", "zephyr", "monarch", "gemma", "orion", "openchat", "vicuna", "deepseek", "command-r", "llama3"}

messages: a sequence of chat messages. chat messages are maps with :role and :content. Typical roles are "assistant", "system", and "user".

opts: A map with the following options: :append-start-assistant-message?: Whether to end the prompt with the token(s) that indicate the start of an assistant message. If omitted, defaults to true.

Throws IllegalArgumentException if the template format is unsupported. See: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template

Throws UnsupportedOperationException for ggml models.

Example: (chat-apply-template ctx {:role "assistant" :content "You are a friendly, helpful assistant."} {:role "user" :content "What is clojure?"} true)

view source

create-context

(create-context model-path)

(create-context model-path {:keys [seed n-ctx n-batch n-gpu-layers main-gpu tensor-split rope-freq-base rope-freq-scale low-vram mul_mat_q f16-kv logits-all vocab-only use-mmap use-mlock embedding gqa rms-norm-eps model-format], :as params})

Create and return an opaque llama context.

model-path should be an absolute or relative path to a ggml or gguf model.

An optional map of parameters may be passed for parameterizing the model. The following keys map to their corresponding llama.cpp equivalents:

:n-ctx: text context, 0 = from model
:n-batch: logical maximum batch size that can be submitted to llama_decode
:n-ubatch: physical maximum batch size
:n-threads: number of threads to use for generation
:n-threads-batch: number of threads to use for batch processing
:n-gpu-layers: number of layers to store in VRAM
:main-gpu: the GPU that is used for the entire model when split_mode is LLAMA_SPLIT_MODE_NONE
:tensor-split: how to split layers across multiple GPUs
:vocab-only: only load the vocabulary, no weights
:use-mmap: use mmap if possible
:use-mlock: force system to keep model in RAM
:check-tensors: validate model tensor data

// ref: https://github.com/ggerganov/llama.cpp/pull/2054

:rope-freq-base: RoPE base frequency, 0 = from model
:rope-freq-scale: RoPE frequency scaling factor, 0 = from model
:yarn-ext-factor: YaRN extrapolation mix factor, negative = from model
:yarn-attn-factor: YaRN magnitude scaling factor
:yarn-beta-fast: YaRN low correction dim
:yarn-beta-slow: YaRN high correction dim
:yarn-orig-ctx: YaRN original context size
:defrag-thold: defragment the KV cache if holes/size > thold, < 0 disabled (default)
:logits-all: the llama_decode() call computes all logits, not just the last one (DEPRECATED - set llama_batch.logits instead)
:embeddings: if true, extract embeddings (together with logits)
:offload-kqv: whether to offload the KQV ops (including the KV cache) to GPU
:flash-attn: whether to use flash attention EXPERIMENTAL
:no-perf: whether to measure performance timings

The :model-format can be specified as either :ggml or :gguf. If not provided, the model format will be guessed by looking at model-path.

Resources can be freed by calling .close on the returned context. Using a closed context is undefined and will probably crash the JVM.

Contexts are not thread-safe. Using the same context on multiple threads is undefined and will probably crash the JVM.

view source

decode-token

(decode-token ctx)(decode-token ctx opts)

Returns a transducer that expects a stream of llama tokens and outputs a stream of strings.

The transducer will buffer intermediate results until enough bytes to decode a character are available. Also combines surrogate pairs of characters.

view source

decode-token-to-char

(decode-token-to-char ctx)(decode-token-to-char ctx opts)

Returns a transducer that expects a stream of llama tokens and outputs a stream of decoded chars.

The transducer will buffer intermediate results until enough bytes to decode a character are available.

view source

end-of-generation?

(end-of-generation? ctx token)

Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)

view source

eos

(eos)(eos ctx)

Returns the llama end of sentence token.

Calling eos without a context is deprecated as not all models use the same bos token.

view source

generate

(generate ctx prompt)(generate ctx prompt opts)

Returns a seqable/reducible sequence of strings generated from ctx with prompt.

view source

generate-embedding

(generate-embedding ctx prompt opts)(generate-embedding ctx prompt)

Returns the embedding for a given input prompt.

The context should have been created with the :embedding option set to true.

Note: embeddings are not normalized. See com.phronemophobic.llama.util/normalize-embedding.

view source

generate-string

(generate-string ctx prompt)(generate-string ctx prompt opts)

Returns a string with all tokens generated from prompt up until end of sentence or max context size.

view source

generate-tokens

(generate-tokens ctx prompt)(generate-tokens ctx prompt {:keys [samplef num-threads seed], :as opts})

Returns a seqable/reducible sequence of tokens from ctx with prompt.

view source

get-embedding

(get-embedding ctx)

Returns a copy of the current context's embedding as a float array.

The context should have been created with the :embedding option set to true.

view source

get-logits

(get-logits ctx)

Returns a copy of the current context's logits as a float array.

view source

init-mirostat-v2-sampler

(init-mirostat-v2-sampler ctx)(init-mirostat-v2-sampler ctx tau eta)

Given a context, returns a sampling function that uses the llama.cpp mirostat_v2 implementation.

view source

llama-update

(llama-update ctx s)(llama-update ctx s n-past)(llama-update ctx s n-past num-threads)

Adds s to the current context and updates the context's logits (see get-logits).

s: either be a string or an integer token. n-past: number of previous tokens to include when updating logits. num-threads: number of threads to use when updating the logits. If not provided, or nil, defaults to *num-threads*.

view source

metadata

(metadata ctx)

Returns a map of the metadata associated with ctx.

view source

model-description

(model-description ctx)

Get a string describing the model type.

view source

model-n-params

(model-n-params ctx)

Returns the total number of parameters in the model.

view source

model-size

(model-size ctx)

Returns the total size of all the tensors in the model in bytes.

view source

n-ctx

(n-ctx ctx)

The context size for the associated model.

view source

n-embd

(n-embd ctx)

The length of the embedding vector for the associated model.

view source

n-vocab

(n-vocab ctx)

The number of available tokens for the associated model.

view source

sample-logits-greedy

(sample-logits-greedy logits)

Returns the token with the highest value.

logits: a collection of floats representing the logits (see get-logits).

view source

set-log-callback

(set-log-callback cb)

Sets the log callback. The callback should be a function that recieves two args: log level and msg. Setting to nil will cause output to be written to stderr. The log callback is global for all contexts.

The log levels are as follows: GGML_LOG_LEVEL_ERROR = 2, GGML_LOG_LEVEL_WARN = 3, GGML_LOG_LEVEL_INFO = 4, GGML_LOG_LEVEL_DEBUG = 5

Only supported for gguf models.

Example: (set-log-callback ctx (fn level msg (println level msg)))

view source

set-rng-seed

(set-rng-seed ctx seed)

Manually set the rng seed for a context.

view source

Generated by Codox with RDash UI theme

llama.clj

Project

Namespaces

Public Vars

com.phronemophobic.llama

num-threads

dynamic

-main

bos

chat-apply-template

create-context

decode-token

decode-token-to-char

end-of-generation?

eos

generate

generate-embedding

generate-string

generate-tokens

get-embedding

get-logits

init-mirostat-v2-sampler

llama-update

metadata

model-description

model-n-params

model-size

n-ctx

n-embd

n-vocab

sample-logits-greedy

set-log-callback

set-rng-seed