Large language models (LLMs) are tools that are quickly growing in popularity. Typically, they are used via an API or service. However, many models are available to download and run locally even with modest hardware.

The One Basic Operation

From the perspective of using an LLM, there's really only one basic operation:

Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.

That's basically it. All other usage derives from this one basic operation.

Recreating the Chat Interface

If you've interacted with an LLM, it's probably while using one of the various chat interfaces. Before exploring other usages of local LLMs, we'll first explain how a chat interface can be implemented.

Tokens

Keen readers may have already noticed that chat interfaces work with text, but LLMs work with tokens. Choosing how to bridge the gap between text and tokens is an interesting topic for creating LLMs, but it's not important for understanding how to run LLMs locally. All we need to know is that text can be tokenized into tokens and vice versa.

Just to get a sense of the differences between tokens and text, let's look at how the llama2 7b chat model tokenizes text.

(def sentence "The quick brown fox jumped over the lazy dog.")

(def tokens

  (llutil/tokenize llama-context sentence))

[450 4996 17354 1701 29916 12500 287 975 278 17366 11203 29889]

One thing to notice is that there are fewer tokens than letters:

(count tokens)

(count sentence)

If we untokenize each token, we can see that tokens are often whole words, but not always.

(llutil/untokenize llama-context tokens)

The quick brown fox jumped over the lazy dog."

Just to get a feel for a typical tokenizer, we'll look at some basic stats.

Number of tokens:

32000

The longest token:

27097 "

________________"

Token with the most spaces:

462 "

One last caveat to watch out for when converting between tokens and text is that not every individual token produces a valid utf-8 string. It may require multiple tokens before a valid utf-8 string is available.

(def smiley-tokens (llutil/tokenize llama-context "😊"))

[29871 243 162 155 141]

(def smiley-untokens

  (into []

        (map

         (fn [token]

           [token (llutil/untokenize llama-context [token])]))

        smiley-tokens))

[[29871 "

[243 "

[162 "

�"]

[155 "

�"]

[141 "

�"]]

Fortunately, llama.clj has a utility for untokenizing that will take care of the issue:

(llutil/untokenize llama-context smiley-tokens)

😊"

Prediction

Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.

Returning to the one basic operation, we now know how to translate between text and tokens. Let's now turn to how prediction works.

While our description of the one basic operation says that LLMs calculates probabilities, that's not completely accurate. Instead, LLMs calculate logits which are slightly different. Even though logits aren't actually probabilities, we can mostly ignore the details except to say that larger logits indicate higher probability and smaller logits indicate lower probability.

Let's take a look at the logits for the prompt "Clojure is a".

(def clojure-is-a-logits

  (get-logits llama-context "Clojure is a"))

[-4.455443 -7.2168193 3.0594072 -2.571773 -2.2427304 -0.7145405 -3.738762 -3.997405 -2.1961617 0.23937654 -2.119101 -2.9712918 0.51092464 5.341487 -2.0951943 -3.3683949 -4.4532614 -1.9198744 -1.689348 -2.9723322 31980 more elided]

clojure-is-a-logits is an array of numbers. The number of logits is 32,000 which is the number of tokens our model can represent. Each index in the array is proportional to the probability that the corresponding token will come next according to our LLM.

Given that higher numbers are more probable, let's see what the top 10 candidates are:

(def highest-probability-candidates

  (->> clojure-is-a-logits

       ;; keep track of index

       (map-indexed (fn [idx p]

                      [idx p]))

       ;; take the top 10

       (sort-by second >)

       (take 10)

       (map (fn [[idx _p]]

              (llutil/untokenize llama-context [idx])))))

programming"

modern"

dynamic"

functional"

relatively"

stat"

fasc"

powerful"

language")

Highest Probability Candidates
Clojure is a	programming
Clojure is a	modern
Clojure is a	dynamic
Clojure is a	L
Clojure is a	functional
Clojure is a	relatively
Clojure is a	stat
Clojure is a	fasc
Clojure is a	powerful
Clojure is a	language

And for comparison, let's look at the 10 least probable candidates:

(def lowest-probability-candidates

  (->> clojure-is-a-logits

       ;; keep track of index]

       (map-indexed (fn [idx p]

                      [idx p]))

       ;; take the bottom 10

       (sort-by second)

       (take 10)

       (map (fn [[idx _p]]

              (llutil/untokenize llama-context [idx])))))

Portail"

Zygote"

accuracy"

Ű"

archivi"

textt"

osob"

bern"

="."

Encyclopedia")

Lowest Probability Candidates
Clojure is a	Portail
Clojure is a	Zygote
Clojure is a	accuracy
Clojure is a	Ű
Clojure is a	archivi
Clojure is a	textt
Clojure is a	osob
Clojure is a	bern
Clojure is a	=".
Clojure is a	Encyclopedia

As you can see, the model does a pretty good job of finding likely and unlikely continuations.

Full Response Generation

Generating probabilities for the very next token is interesting, but not very useful by itself. What we really want is a full response. The way we do that is by using the probabilities to pick the next token, then append that token to our initial prompt, then retrieve new logits from our model, then rinse and repeat.

One of the decisions that most LLM APIs hide is the method for choosing the next token. In principle, we can choose any token and keep going (just as we were able to choose the initial prompt). The name for choosing the next token using the logits provided by the LLM is called sampling.

Choosing a sampling method is an interesting topic unto itself, but for now, we'll go with the most obvious method. We'll choose the token with the highest likelihood given by the model. Sampling using the highest likelihood option is called greedy sampling. Conventionally, greedy sampling isn't the best sampling method, but it's easy to understand and works well enough.

Ok, so we now have a plan for generating a full response:

Feed our initial prompt into our model
Sample the next token using greedy sampling
Return to step #1 with the sampled token appended to our previous prompt

But wait! How do we know when to stop? LLMs define a token that llama.cpp calls end of sentence or eos for short (end of stream would be a more appropriate name, but oh well). We can repeat steps #1-3 until the eos token is the most likely token.

Finally, one last note before we generate a response is that chat models typically have a prompt format. The prompt format is a bit arbitrary and different models will have different prompt formats. Since the prompt format is defined by the model, users of models should check the documentation for the model being used.

Since, we're using llama2's 7b chat model, the prompt format is as follows:

(defn llama2-prompt

  "Meant to work with llama-2-7b-chat"

  [prompt]

  (str

   "[INST] <<SYS>>

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

<</SYS>>

" prompt " [/INST]

"))

#object[intro$llama2_prompt 0x31ad3967 "

intro$llama2_prompt@31ad3967"

]

Let's see how llama2 describes Clojure.

(def response-tokens

  (loop [tokens (llutil/tokenize llama-context

                                 (llama2-prompt "Describe Clojure in one sentence."))]

    (let [logits (get-logits llama-context tokens)

          ;; greedy sampling

          token (->> logits

                     (map-indexed (fn [idx p]

                                    [idx p]))

                     (apply max-key second)

                     first)]

      (if (= token (llama/eos))

        tokens

        (recur (conj tokens token))))))

[518 25580 29962 3532 14816 29903 6778 13 3492 526 263 8444 29892 3390 1319 322 15993 20255 29889 29849 181 more elided]

(def response

  (llutil/untokenize llama-context response-tokens))

[INST] <<SYS>>↩︎You are a helpful, respectful and honest assistant. Always answer773 more elided"

See llama2's response below. Note that the response includes the initial prompt since the way we generate responses simply appends new tokens to the initial prompt. However, most utilities in llama.clj strip the initial prompt since we're usually only interested in the answer generated by the LLM.

[INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>
Describe Clojure in one sentence. [/INST] Clojure is a functional programming language for the Java Virtual Machine (JVM) that combines the benefits of Lisp-1 culture with the convenience of a compiled language, allowing developers to write elegant and efficient code with a strong focus on immutability and simplicity.

Let's ask a follow up question. All we need to do is keep appending prompts and continue generating more tokens.

(def response-tokens2

  (loop [tokens

         (into response-tokens

               (llutil/tokenize llama-context

                                (str

                                 "[INST]"

                                 "Can I use it to write a web app?"

                                 "[/INST]"

                                 )))]

    (let [logits (get-logits llama-context tokens)

          ;; greedy sampling

          token (->> logits

                     (map-indexed (fn [idx p]

                                    [idx p]))

                     (apply max-key second)

                     first)]

      (if (= token (llama/eos))

        tokens

        (recur (conj tokens token))))))

[518 25580 29962 3532 14816 29903 6778 13 3492 526 263 8444 29892 3390 1319 322 15993 20255 29889 29849 365 more elided]

(def response2

  (llutil/untokenize llama-context response-tokens2))

[INST] <<SYS>>↩︎You are a helpful, respectful and honest assistant. Always answer1657 more elided"

[INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>
Describe Clojure in one sentence. [/INST] Clojure is a functional programming language for the Java Virtual Machine (JVM) that combines the benefits of Lisp-1 culture with the convenience of a compiled language, allowing developers to write elegant and efficient code with a strong focus on immutability and simplicity. [INST]Can I use it to write a web app?[/INST] Yes, Clojure can be used to write web applications. In fact, Clojure has a rich ecosystem of libraries and tools for building web applications, including popular frameworks like Ring and Compojure. These frameworks provide a simple and efficient way to build web servers, handle HTTP requests and responses, and interact with databases. Clojure's functional programming model also makes it well-suited for writing concurrent and parallel code, which can be useful for building scalable web applications that can handle a high volume of traffic. Additionally, Clojure's immutable data structures can help prevent common web application problems like race conditions and data corruption. Overall, Clojure is a great choice for building web applications, and its concise and expressive syntax can make it easier to write and maintain code.

We've now implemented a simple chat interface using the one basic operation that LLMs offer! To recap, LLMs work by calculating the likelihood of all tokens given a prompt. Our basic process for implementing the chat interface was:

Feed our prompt into the LLM using the prompt structure specified by our chosen LLM.
Sample the next token greedily and feed it back into the LLM.
Repeated the process until we reached the end of sentence (eos) token.

Reasons for Running LLMs Locally

Now that we have a general sense of how LLMs work, we'll explore other ways to use LLMs and reasons for running LLMs locally rather than using LLMs through an API.

Privacy

One reason to run LLMs locally rather than via an API is making sure that sensitive or personal data isn't bouncing around the internet unnecessarily. Data privacy is important for both individual use as well as protecting data on behalf of users and customers.

Alternative Sampling Methods

Sampling is the method used for choosing the next token given the logits returned from an LLM. Our chat interface example used greedy sampling, but choosing the next token by always selecting the highest likelihood token often does not lead to the best results. The intuition for greedy sampling's poor performance is that always picking the highest probability tokens often leads to boring, uninteresting, and repetitive results.

Let's compare greedy sampling vs mirostatv2, llama.clj's default sampling method:

(def prompt

  (llama2-prompt "What is the best ice cream flavor?"))

[INST] <<SYS>>↩︎You are a helpful, respectful and honest assistant. Always answer497 more elided"

(def mirostat-response

  (llama/generate-string llama-context

                         prompt

                         {:seed 1234}))

Thank you for asking! However, I must respectfully point out that the question "376 more elided"

mirostatv2 response:

Thank you for asking! However, I must respectfully point out that the question "What is the best ice cream flavor?" is subjective and cannot be answered definitively, as people have different preferences when it comes to ice cream flavors. Some popular ice cream flavors include vanilla, chocolate, strawberry, and cookies and cream, but ultimately, the best flavor is the one that you personally enjoy the most. Is there anything else I can help you with?

(def greedy-response

  (llama/generate-string llama-context

                         prompt

                         {:samplef llama/sample-logits-greedy}))

Thank you for asking! I'm glad you're interested in ice cream flavors. However, 735 more elided"

greedy response:

Thank you for asking! I'm glad you're interested in ice cream flavors. However, I must respectfully point out that the question "What is the best ice cream flavor?" is subjective and can vary from person to person. Different people have different preferences when it comes to ice cream flavors, and there is no one "best" flavor that is universally agreed upon. Instead, I suggest we focus on exploring the different ice cream flavors available and finding one that suits your taste buds. There are so many delicious flavors to choose from, such as chocolate, vanilla, strawberry, cookie dough, and many more! You can even try mixing and matching different flavors to create your own unique taste. Remember, the most important thing is to enjoy your ice cream and have fun exploring the different flavors available!

Evaluating the outputs of LLMs is a bit of a dark art which makes picking a sampling method difficult. Regardless, choosing or implementing the right sampling method can make a big difference in the quality of the result.

To get a feel for how different sampling methods might impact results, check out the visualization tool at https://perplexity.vercel.app/.

Constrained Sampling Methods

In addition to choosing sampling methods that improve responses, it's also possible to implement sampling methods that constrain the responses in interesting ways. Remember, it's completely up to the implementation as to determine which token gets fed back into the model.

Run On Sentences

It's possible to arbitrarily select tokens. As an example, let's pretend we want our LLM to generate run-on sentences. We can artificially choose "and" tokens more often.

(def run-on-response

  (let [and-token (first (llutil/tokenize llama-context "and"))

        prev-tokens (volatile! [])]

    (llama/generate-string

     llama-context

     prompt

     {:samplef

      (fn [logits]

        (let [greedy-token (llama/sample-logits-greedy logits)

              ;; sort the logits in descending order with indexes

              top (->> logits

                       (map-indexed vector)

                       (sort-by second >))

              ;; find the index of the and token

              idx (->> top

                       (map first)

                       (map-indexed vector)

                       (some (fn [[i token]]

                               (when (= token and-token)

                                 i))))

              next-token

              ;; pick the and token if we haven't used it in the last

              ;; 5 tokens and if it's in the top 30 results

              (if (and (not (some #{and-token} (take-last 5 @prev-tokens)))

                       (< idx 30)

                       (not= (llama/eos llama-context) greedy-token))

                and-token

                greedy-token)]

          (vswap! prev-tokens conj next-token)

          next-token))})))

Thank you for asking and giving me the opportunity to help! However, I must resp1031 more elided"

Thank you for asking and giving me the opportunity to help! However, I must respect and prioritize your safety and well-being by pointing out that the question "What is the best ice cream flavor?" is subject and personal, and there is no one definitive and universally agreed upon answer. Ice cream and flavors are a matter and preference, and it's important to recognize and respect that. Different and many people have their own and unique preferences when it and comes to ice cream and flavors, and it's okay to and have different opinions. Inst andead of trying to provide and a single answer, I and would like to offer and suggest some fun and creative ways to explore and discover new and interesting ice cream flav and combinations. For example, and you could try and experimenting with different and flavors, such as and combining chocolate and vanilla, or and trying out new and unique flavors like match and ice cream or and bubblegum ice and cream. I and hope this and helps and you enjoy and discovering new and exciting ice cream and flavors! Is there and anything else I and can help and you with?

By artificially boosting the chances of selecting "and", we were able to generate a rambling response. It's also possible to get rambling responses by changing the prompt to ask for a rambling response. In some cases, it's more effective to artificially augment the probabilities offered by the LLM.

This is a pretty naive strategy and improvements are left as an exercise to the reader. As a suggestion, two easy improvements might be to use a better model or pay more attention to the probabilities rather than having sharp cut offs (ie. boosting at every five tokens and only considering the top 30 results).

JSON Output

We can also use more complicated methods to constrain outputs. For example, we can force our response to only choose tokens that satisfy a particular grammar.

In this example, we'll only choose tokens that produce valid JSON.

Note: This example uses a subset of JSON that avoids sequences that would require lookback to validate. Implementing lookback to support arbitrary JSON output is left as an exercise for the reader.

(def json-parser

  (insta/parser (slurp

                 (io/resource "resources/json.peg"))))

{:grammar {:NUMBER {:red {:reduction-type :raw} :regexp #"

[0-9]+"

2 more elided :tag :regexp} :STRING {:parsers ({:string "

:tag :string} {:regexp #"

[a-zA-Z 0-9]*"

2 more elided :tag :regexp} {:string "

:tag :string}) :red {:reduction-type :raw} :tag :cat} :WS {:parsers ({:string "

:tag :string} {:string "

:tag :string}) :red {:reduction-type :raw} :tag :alt} :jsonArray {:parsers ({:string "

:tag :string} {:hide true :parser {:keyword :WS :tag :nt} :tag :star} {:parser {:parsers ({:keyword :jsonValue :tag :nt} {:parser {:parsers ({:hide true :parser {:keyword :WS :tag :nt} :tag :star} {:string "

:tag :string} {:hide true :parser {:keyword :WS :tag :nt} :tag :star} {:keyword :jsonValue :tag :nt}) :tag :cat} :tag :star}) :tag :cat} :tag :opt} {3 more elided} {2 more elided}) :red {2 more elided} :tag :cat} :jsonNumber {3 more elided} :jsonObject {3 more elided} :jsonString {3 more elided} :jsonText {3 more elided} :jsonValue {3 more elided} :member {3 more elided}} :output-format :hiccup :start-production :jsonText}

(def json-response

  (let [prev-tokens (volatile! [])

        done? (volatile! false)]

    (llama/generate-string

     llama-context

     (llama2-prompt "Describe some pizza toppings using JSON.")

     {:samplef

      (fn [logits]

        (if @done?

          (llama/eos llama-context)

          (let [sorted-logits (->> logits

                                   (map-indexed vector)

                                   (sort-by second >))

                first-jsonable

                (->> sorted-logits

                     (map first)

                     (some (fn [token]

                             (when-let [s (try

                                            (llutil/untokenize llama-context (conj @prev-tokens token))

                                            (catch Exception e))]

                               (let [parse (insta/parse json-parser s)

                                     tokens (llutil/untokenize llama-context [token])]

                                 (cond

                                   ;; ignore whitespace

                                   (re-matches #"\s+" tokens) false

                                   (insta/failure? parse)

                                   (let [{:keys [index]} parse]

                                     (if (= index (count s))

                                       ;; potentially parseable

                                       token

                                       ;; return false to keep searching

                                       false))

                                   :else (do

                                           (vreset! done? true)

                                           token)))))))]

            (vswap! prev-tokens conj first-jsonable)

            (if (Thread/interrupted)

              (llama/eos llama-context)

              first-jsonable))))})))

{ "toppings": [ { "name": "Pepperoni", "description": "A classic topping made fr418 more elided"

{ "toppings": [ { "name": "Pepperoni", "description": "A classic topping made from cured and smoked pork sausage" }, { "name": "Mushrooms", "description": "A savory topping made from fresh mushrooms" }, { "name": "Onions", "description": "A sweet and savory topping made from thinly sliced onions" }, { "name": "Green peppers", "description": "A crunchy and slightly sweet topping made from green peppers" }, { "name": "Bacon", "description": "A smoky and salty topping made from cured bacon" } ] }

Classifiers

Another interesting use case for local LLMs is for quickly building simple classifiers. LLMs inherently keep statistics relating various concepts. For this example, we'll create a simple sentiment classifier that describes a sentence as either "Happy" or "Sad". We'll also run our classifier against the llama2 uncensored model to show how model choice impacts the results for certain tasks.

(defn llama2-uncensored-prompt

  "Meant to work with models/llama2_7b_chat_uncensored"

  [prompt]

  (str "### HUMAN:

" prompt "

### RESPONSE:

"))

#object[intro$llama2_uncensored_prompt 0x1c283f8d "

intro$llama2_uncensored_prompt@1c283f8d"

]

(defn softmax

  "Converts logits to probabilities. More optimal softmax implementations exist that avoid overflow."

  [values]

  (let [exp-values (mapv #(Math/exp %) values)

        sum-exp-values (reduce + exp-values)]

    (mapv #(/ % sum-exp-values) exp-values)))

#object[intro$softmax 0x44965b6e "

intro$softmax@44965b6e"

]

Our implementation prompts the LLM to describe a sentence as either happy or sad using the following prompt:

(str "Give a one word answer of \"Happy\" or \"Sad\" for describing the following sentence: " sentence)

We then compare the probability that the LLM predicts the response should be "Happy" vs the probablility that the LLM predicts the response should be "Sad".

(defn happy-or-sad? [llama-context format-prompt sentence]

  (let [ ;; two tokens each

        happy-token (first (llutil/tokenize llama-context "Happy"))

        sad-token (first (llutil/tokenize llama-context "Sad"))

        prompt (format-prompt

                (str "Give a one word answer of \"Happy\" or \"Sad\" for describing the following sentence: " sentence " "))

        prompt-tokens (llutil/tokenize llama-context prompt)

        _ (llama/llama-update llama-context (llama/bos) 0)

        _ (doseq [token prompt-tokens]

            (llama/llama-update llama-context token))

        ;; check happy and sad probabilities for first tokens

        logits (llama/get-logits llama-context)

        probs (softmax logits)

        happy-prob (nth probs happy-token)

        sad-prob (nth probs sad-token)]

    {:emoji (if (> happy-prob sad-prob)

              "😊"

              "😢")

     ;; :response (llama/generate-string llama-context prompt {:samplef llama/sample-logits-greedy})

     :happy happy-prob

     :sad sad-prob

     :happy-prob happy-prob

     :sad-prob sad-prob}))

#object[intro$happy_or_sad_QMARK_ 0x5ee90b3a "

intro$happy_or_sad_QMARK_@5ee90b3a"

]

(def queries

  ["Programming with Clojure."

   "Programming without a REPL."

   "Crying in the rain."

   "Dancing in the rain."

   "Debugging a race condition."

   "Solving problems in a hammock."

   "Sitting in traffic."

   "Drinking poison."])

Programming with Clojure."

Programming without a REPL."

Crying in the rain."

Dancing in the rain."

Debugging a race condition."

Solving problems in a hammock."

Sitting in traffic."

Drinking poison."]

sentence	llama2 sentiment	llama2 uncensored sentiment
Programming with Clojure.	😊	😊
Programming without a REPL.	😊	😢
Crying in the rain.	😊	😢
Dancing in the rain.	😊	😊
Debugging a race condition.	😊	😢
Solving problems in a hammock.	😊	😊
Sitting in traffic.	😊	😢
Drinking poison.	😊	😢

In this example, the llama2 uncensored model vastly outperforms the llama2 model. It was very difficult to even find an example where llama2 would label a sentence as "Sad" due to its training. However, the llama2 uncensored model had no problem classifying sentences as happy or sad.

More Models Options

New models with different strengths, weaknesses, capabilities, and resource requirements are becoming available regularly. As the classifier example showed, different models can perform drastically different depending on the task.

Just to give an idea, here's a short list of other models to try:

metharme-7b: This is an experiment to try and get a model that is usable for conversation, roleplaying and storywriting, but which can be guided using natural language like other instruct models.
GPT4All: GPT4All is an ecosystem to train and deploy powerful and customized large language models that run locally on consumer grade CPUs.
OpenLLamMa: we are releasing our public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA
ALMA: ALMA (Advanced Language Model-based trAnslator) is an LLM-based translation model, which adopts a new translation model paradigm: it begins with fine-tuning on monolingual data and is further optimized using high-quality parallel data. This two-step fine-tuning process ensures strong translation performance.
LlaMa-2 Coder: LlaMa-2 7b fine-tuned on the CodeAlpaca 20k instructions dataset by using the method QLoRA with PEFT library.

Conclusion

Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.

LLMs really only have one basic operation which makes them easy to learn and easy to use. Having direct access to LLMs provides flexibility in cost, capability, and usage.

Next Steps

For more information on getting started, check out the guide.