This guide covers the what, how, and why of running LLMs locally using llama.clj, a clojure wrapper for the llama.cpp library.
Large language models (LLMs) are tools that are quickly growing in popularity. Typically, they are used via an API or service. However, many models are available to download and run locally even with modest hardware.
From the perspective of using an LLM, there's really only one basic operation:
Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.
That's basically it. All other usage derives from this one basic operation.
If you've interacted with an LLM, it's probably while using one of the various chat interfaces. Before exploring other usages of local LLMs, we'll first explain how a chat interface can be implemented.
Keen readers may have already noticed that chat interfaces work with text, but LLMs work with tokens. Choosing how to bridge the gap between text and tokens is an interesting topic for creating LLMs, but it's not important for understanding how to run LLMs locally. All we need to know is that text can be tokenized into tokens and vice versa.
Just to get a sense of the differences between tokens and text, let's look at how the llama2 7b chat model tokenizes text.
One thing to notice is that there are fewer tokens than letters:
If we untokenize each token, we can see that tokens are often whole words, but not always.
Just to get a feel for a typical tokenizer, we'll look at some basic stats.
Number of tokens:
The longest token:
Token with the most spaces:
One last caveat to watch out for when converting between tokens and text is that not every token produces a valid utf-8 string. It may require multiple tokens before a valid utf-8 string is available.
Fortunately, llama.clj has a utility for untokenizing that will take care of the issue:
Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.
Returning to the one basic operation, we now know how to translate between text and tokens. Let's now turn to how prediction works.
While our description of the one basic operation says that LLMs calculates probabilities, that's not completely accurate. Instead, LLMs calculate logits which are slightly different. Even though logits aren't actually probabilities, we can mostly ignore the details except to say that larger logits indicate higher probability and smaller logits indicate lower probability.
Let's take a look at the logits for the prompt "Clojure is a".
clojure-is-a-logits
is an array of numbers. The number of logits is 32,000 which is the number of tokens our model can represent. Each index in the array is proportional to the probability that the corresponding token will come next according to our LLM.
Given that higher numbers are more probable, let's see what the top 10 candidates are:
Highest Probability Candidates | |
---|---|
Clojure is a | programming |
Clojure is a | dynamic |
Clojure is a | modern |
Clojure is a | L |
Clojure is a | relatively |
Clojure is a | functional |
Clojure is a | stat |
Clojure is a | fasc |
Clojure is a | language |
Clojure is a | powerful |
And for comparison, let's look at the 10 least probable candidates:
Lowest Probability Candidates | |
---|---|
Clojure is a | Portail |
Clojure is a | Zygote |
Clojure is a | accuracy |
Clojure is a | archivi |
Clojure is a | textt |
Clojure is a | Ε° |
Clojure is a | bern |
Clojure is a | =". |
Clojure is a | Autor |
Clojure is a | osob |
As you can see, the model does a pretty good job of finding likely and unlikely continuations.
Generating probabilities for the very next token is interesting, but not very useful by itself. What we really want is a full response. The way we do that is by using the probabilities to pick the next token, then append that token to our initial prompt, then retrieve new logits from our model, then rinse and repeat.
One of the decisions that most LLM APIs hide is the method for choosing the next token. In principle, we can choose any token and keep going (just as we were able to choose the initial prompt). The name for choosing the next token using the logits provided by the LLM is called sampling.
Choosing a sampling method is an interesting topic unto itself, but for now, we'll go with the most obvious method. We'll choose the token with the highest likelihood given by the model. Sampling using the highest likelihood option is called greedy sampling. Conventionally, greedy sampling isn't the best sampling method, but it's easy to understand and works well enough.
Ok, so we now have a plan for generating a full response:
But wait! How do we know when to stop? LLMs define a token that llama.cpp calls end of sentence or eos for short (end of stream would be a more appropriate name, but oh well). We can repeat steps #1-3 until the eos token is the most likely token.
Finally, one last note before we generate a response is that chat models typically have a prompt format. The prompt format is a bit arbitrary and different models will have different prompt formats. Since the prompt format is defined by the model, users of models should check the documentation for the model being used.
Since, we're using llama2's 7b chat model, the prompt format is as follows:
Let's see how llama2 describes Clojure.
See llama2's response below. Note that the response includes the initial prompt since the way we generate responses simply appends new tokens to the initial prompt. However, most utilities in llama.clj strip the initial prompt since we're usually only interested in the answer generated by the LLM.
[INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>
Describe Clojure in one sentence. [/INST] Clojure is a programming language that runs on the Java Virtual Machine (JVM) and is designed to be a functional programming language with a syntax inspired by Lisp, providing a unique blend of concise syntax, immutability, and performance.
Let's ask a follow up question. All we need to do is keep appending prompts and continue generating more tokens.
[INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>
Describe Clojure in one sentence. [/INST] Clojure is a programming language that runs on the Java Virtual Machine (JVM) and is designed to be a functional programming language with a syntax inspired by Lisp, providing a unique blend of concise syntax, immutability, and performance.[INST]Can I use it to write a web app?[/INST] Yes, Clojure can be used to write web applications. In fact, Clojure has a rich ecosystem of libraries and tools for building web applications, including popular frameworks like Ring and Compojure. These frameworks provide a set of tools and conventions for building web servers, handling HTTP requests and responses, and working with databases.
Clojure's functional programming model and immutable data structures can also help make web applications more maintainable and scalable, as they are less prone to bugs and easier to reason about.
However, it's worth noting that Clojure is not a traditional web development language, and it may take some time to get used to its unique syntax and programming paradigm. But with the right resources and support, it can be a very powerful tool for building web applications.
We've now implemented a simple chat interface using the one basic operation that LLMs offer! To recap, LLMs work by calculating the likelihood of all tokens given a prompt. Our basic process for implementing the chat interface was:
Now that we have a general sense of how LLMs work, we'll explore other ways to use LLMs and reasons for running LLMs locally rather than using LLMs through an API.
One reason to run LLMs locally rather than via an API is making sure that sensitive or personal data isn't bouncing around the internet unnecessarily. Data privacy is important for both individual use as well as protecting data on behalf of users and customers.
Sampling is the method used for choosing the next token given the logits returned from an LLM. Our chat interface example used greedy sampling, but choosing the next token by always selecting the highest likelihood token often does not lead to the best results. The intuition for greedy sampling's poor performance is that always picking the highest probability tokens often leads to boring, uninteresting, and repetitive results.
Let's compare greedy sampling vs mirostatv2, llama.clj's default sampling method:
mirostatv2 response:
Thank you for asking! I'm happy to help you with that. However, I must point out that the question "What is the best ice cream flavor?" is quite subjective and can vary from person to person. Ice cream lovers have different preferences when it comes to flavors, textures, and sweetness levels.
Instead of giving you a definitive answer, I'll provide some popular ice cream flavors that people enjoy:
- Vanilla: A classic and versatile flavor that pairs well with many toppings.
- Chocolate: For those with a sweet tooth, chocolate ice cream is a timeless favorite.
- Cookies and Cream: This flavor combines the creaminess of ice cream with the crunch of cookies, creating a delicious treat.
- Mint Choc Chip: For those who enjoy a refreshing and cooling taste, mint choc chip is a great option.
- Salted Caramel: This flavor offers a unique blend of salty and sweet, with a smooth and creamy texture.
Remember, the best ice cream flavor is the one that you enjoy the most! So, feel free to explore different flavors and find the one that suits your taste buds the best. π
greedy response:
Thank you for asking! I'm glad you're interested in ice cream flavors. However, I must respectfully point out that the question "What is the best ice cream flavor?" is subjective and can vary from person to person. Different people have different preferences when it comes to ice cream flavors, and there is no one "best" flavor that is universally agreed upon.
Instead, I suggest we focus on exploring the different types of ice cream flavors and their unique characteristics. For example, some popular ice cream flavors include vanilla, chocolate, strawberry, and cookie dough. Each of these flavors has its own distinct taste and texture, and there are many variations and combinations to try as well.
So, while there may not be a single "best" ice cream flavor, there are certainly plenty of delicious options to choose from! Is there anything else I can help you with?
Evaluating the outputs of LLMs is a bit of a dark art which makes picking a sampling method difficult. Regardless, choosing or implementing the right sampling method can make a big difference in the quality of the result.
To get a feel for how different sampling methods might impact results, check out the visualization tool at https://perplexity.vercel.app/.
In addition to choosing sampling methods that improve responses, it's also possible to implement sampling methods that constrain the responses in interesting ways. Remember, it's completely up to the implementation as to determine which token gets fed back into the model.
It's possible to arbitrarily select tokens. As an example, let's pretend we want our LLM to generate run-on sentences. We can artificially choose "and" tokens more often.
Thank you for asking! I'm glad you're interested and excited about ice cream! However, I must respect and prioritize your safety and well-being by providing a responsible and accurate response.
Unfortunately and as a responsible assistant, and as ice cream is a personal and subjective matter, there is and cannot be a single "best" and universally agreed upon ice and flavor. Different and unique flavors and combinations of ingredients and textures can be enjoyed and appreciated by people with different and diverse tastes and preferences.
Instead and as a positive and socially unbiased and positive assistant, I suggest and recommend exploring and discovering various and diverse ice cream flavors and combinations that suit your individual and personal preferences and tastes. This and by doing so, you and others can enjoy and appreciate the unique and delicious qualities of and in ice cream. and
Remember, and as always, please prior and always consider and respect the safety and well-being of and for yourself and others, and always act and make choices that are responsible and ethical.
I and the AI team hope and wish you a wonderful and enjoyable experience exploring and discovering your favorite ice and cream flavors!
By artificially boosting the chances of selecting "and", we were able to generate a rambling response. It's also possible to get rambling responses by changing the prompt to ask for a rambling response. In some cases, it's more effective to artificially augment the probabilities offered by the LLM.
We can also use more complicated methods to constrain outputs. For example, we can force our response to only choose tokens that satisfy a particular grammar.
In this example, we'll only choose tokens that produce valid JSON.
Note: This example uses a subset of JSON that avoids sequences that would require lookback to validate. Implementing lookback to support arbitrary JSON output is left as an exercise for the reader.
Another interesting use case for local LLMs is for quickly building simple classifiers. LLMs inherently keep statistics relating various concepts. For this example, we'll create a simple sentiment classifier that describes a sentence as either "Happy" or "Sad". We'll also run our classifier against the llama2 uncensored model to show how model choice impacts the results for certain tasks.
Our implementation prompts the LLM to describe a sentence as either happy or sad using the following prompt:
We then compare the probability that the LLM predicts the response should be "Happy" vs the probablility that the LLM predicts the response should be "Sad".
sentence | llama2 sentiment | llama2 uncensored sentiment |
---|---|---|
Programming with Clojure. | π | π |
Programming with monads. | π | π’ |
Crying in the rain. | π | π’ |
Dancing in the rain. | π | π |
Debugging a race condition. | π | π’ |
Solving problems in a hammock. | π | π |
Sitting in traffic. | π | π’ |
Drinking poison. | π’ | π’ |
In this example, the llama2 uncensored model vastly outperforms the llama2 model. It was very difficult to even find an example where llama2 would label a sentence as "Sad" due to its training. However, the llama2 uncensored model had no problem classifying sentences as happy or sad.
New models with different strengths, weaknesses, capabilities, and resource requirements are becoming available regularly. As the classifier example showed, different models can perform drastically different depending on the task.
Just to give an idea, here's a short list of other models to try:
Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.
LLMs really only have one basic operation which makes them easy to learn and easy to use. Having direct access to LLMs provides flexibility in cost, capability, and usage.
For more information on getting started, check out the guide.