This guide covers the what, how, and why of running LLMs locally using llama.clj, a clojure wrapper for the llama.cpp library.
Large language models (LLMs) are tools that are quickly growing in popularity. Typically, they are used via an API or service. However, many models are available to download and run locally even with modest hardware.
From the perspective of using an LLM, there's really only one basic operation:
Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.
That's basically it. All other usage derives from this one basic operation.
If you've interacted with an LLM, it's probably while using one of the various chat interfaces. Before exploring other usages of local LLMs, we'll first explain how a chat interface can be implemented.
Keen readers may have already noticed that chat interfaces work with text, but LLMs work with tokens. Choosing how to bridge the gap between text and tokens is an interesting topic for creating LLMs, but it's not important for understanding how to run LLMs locally. All we need to know is that text can be tokenized into tokens and vice versa.
Just to get a sense of the differences between tokens and text, let's look at how the llama2 7b chat model tokenizes text.
One thing to notice is that there are fewer tokens than letters:
If we untokenize each token, we can see that tokens are often whole words, but not always.
Just to get a feel for a typical tokenizer, we'll look at some basic stats.
Number of tokens:
The longest token:
Token with the most spaces:
One last caveat to watch out for when converting between tokens and text is that not every individual token produces a valid utf-8 string. It may require multiple tokens before a valid utf-8 string is available.
Fortunately, llama.clj has a utility for untokenizing that will take care of the issue:
Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.
Returning to the one basic operation, we now know how to translate between text and tokens. Let's now turn to how prediction works.
While our description of the one basic operation says that LLMs calculates probabilities, that's not completely accurate. Instead, LLMs calculate logits which are slightly different. Even though logits aren't actually probabilities, we can mostly ignore the details except to say that larger logits indicate higher probability and smaller logits indicate lower probability.
Let's take a look at the logits for the prompt "Clojure is a".
clojure-is-a-logits
is an array of numbers. The number of logits is 32,000 which is the number of tokens our model can represent. Each index in the array is proportional to the probability that the corresponding token will come next according to our LLM.
Given that higher numbers are more probable, let's see what the top 10 candidates are:
Highest Probability Candidates | |
---|---|
Clojure is a | programming |
Clojure is a | modern |
Clojure is a | dynamic |
Clojure is a | L |
Clojure is a | functional |
Clojure is a | relatively |
Clojure is a | stat |
Clojure is a | fasc |
Clojure is a | powerful |
Clojure is a | language |
And for comparison, let's look at the 10 least probable candidates:
Lowest Probability Candidates | |
---|---|
Clojure is a | Portail |
Clojure is a | Zygote |
Clojure is a | accuracy |
Clojure is a | Ε° |
Clojure is a | textt |
Clojure is a | archivi |
Clojure is a | bern |
Clojure is a | =". |
Clojure is a | osob |
Clojure is a | Encyclopedia |
As you can see, the model does a pretty good job of finding likely and unlikely continuations.
Generating probabilities for the very next token is interesting, but not very useful by itself. What we really want is a full response. The way we do that is by using the probabilities to pick the next token, then append that token to our initial prompt, then retrieve new logits from our model, then rinse and repeat.
One of the decisions that most LLM APIs hide is the method for choosing the next token. In principle, we can choose any token and keep going (just as we were able to choose the initial prompt). The name for choosing the next token using the logits provided by the LLM is called sampling.
Choosing a sampling method is an interesting topic unto itself, but for now, we'll go with the most obvious method. We'll choose the token with the highest likelihood given by the model. Sampling using the highest likelihood option is called greedy sampling. Conventionally, greedy sampling isn't the best sampling method, but it's easy to understand and works well enough.
Ok, so we now have a plan for generating a full response:
But wait! How do we know when to stop? LLMs define a token that llama.cpp calls end of sentence or eos for short (end of stream would be a more appropriate name, but oh well). We can repeat steps #1-3 until the eos token is the most likely token.
Finally, one last note before we generate a response is that chat models typically have a prompt format. The prompt format is a bit arbitrary and different models will have different prompt formats. Since the prompt format is defined by the model, users of models should check the documentation for the model being used.
Since, we're using llama2's 7b chat model, the prompt format is as follows:
Let's see how llama2 describes Clojure.
See llama2's response below. Note that the response includes the initial prompt since the way we generate responses simply appends new tokens to the initial prompt. However, most utilities in llama.clj strip the initial prompt since we're usually only interested in the answer generated by the LLM.
[INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>
Describe Clojure in one sentence. [/INST] Clojure is a functional programming language for the Java Virtual Machine (JVM) that combines the benefits of Lisp-1 culture with the convenience of a compiled language, allowing developers to write elegant and efficient code with a strong focus on immutability and simplicity.
Let's ask a follow up question. All we need to do is keep appending prompts and continue generating more tokens.
[INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>
Describe Clojure in one sentence. [/INST] Clojure is a functional programming language for the Java Virtual Machine (JVM) that combines the benefits of Lisp-1 culture with the convenience of a compiled language, allowing developers to write elegant and efficient code with a strong focus on immutability and simplicity. [INST]Can I use it to write a web app?[/INST] Yes, Clojure can be used to write web applications. In fact, Clojure has a rich ecosystem of libraries and tools for building web applications, including popular frameworks like Ring and Compojure. These frameworks provide a simple and efficient way to build web servers, handle HTTP requests and responses, and interact with databases. Clojure's functional programming model also makes it well-suited for writing concurrent and parallel code, which can be useful for building scalable web applications that can handle a high volume of traffic. Additionally, Clojure's immutable data structures can help prevent common web application problems like race conditions and data corruption. Overall, Clojure is a great choice for building web applications, and its rich ecosystem and strong community make it easy to find help and resources when needed.
We've now implemented a simple chat interface using the one basic operation that LLMs offer! To recap, LLMs work by calculating the likelihood of all tokens given a prompt. Our basic process for implementing the chat interface was:
Now that we have a general sense of how LLMs work, we'll explore other ways to use LLMs and reasons for running LLMs locally rather than using LLMs through an API.
One reason to run LLMs locally rather than via an API is making sure that sensitive or personal data isn't bouncing around the internet unnecessarily. Data privacy is important for both individual use as well as protecting data on behalf of users and customers.
Sampling is the method used for choosing the next token given the logits returned from an LLM. Our chat interface example used greedy sampling, but choosing the next token by always selecting the highest likelihood token often does not lead to the best results. The intuition for greedy sampling's poor performance is that always picking the highest probability tokens often leads to boring, uninteresting, and repetitive results.
Let's compare greedy sampling vs mirostatv2, llama.clj's default sampling method:
mirostatv2 response:
Thank you for asking! However, I must respectfully point out that the question "What is the best ice cream flavor?" is subjective and cannot be answered definitively. People have different preferences when it comes to ice cream flavors, and what one person might consider the best, another might not. Instead, I can offer some popular ice cream flavors that are loved by many:
- Vanilla
- Chocolate
- Cookies and Cream
- Mint Chocolate Chip
- Strawberry
- Rocky Road
- Brownie Batter
- Salted Caramel
- Matcha Green Tea
Of course, these are just a few examples, and there are countless other delicious ice cream flavors to choose from! Feel free to share your favorite ice cream flavor with me, and I'll make sure to add it to the list.
greedy response:
Thank you for asking! I'm glad you're interested in ice cream flavors. However, I must respectfully point out that the question "What is the best ice cream flavor?" is subjective and can vary from person to person. Different people have different preferences when it comes to ice cream flavors, and there is no one "best" flavor that is universally agreed upon. Instead, I suggest we focus on exploring the different ice cream flavors available and finding one that suits your taste buds. There are so many delicious flavors to choose from, such as classic vanilla, rich chocolate, fruity strawberry, and creamy caramel. You can also try mixing and matching different flavors to create your own unique taste. Remember, the most important thing is to enjoy your ice cream and have fun exploring the different flavors! π
Evaluating the outputs of LLMs is a bit of a dark art which makes picking a sampling method difficult. Regardless, choosing or implementing the right sampling method can make a big difference in the quality of the result.
To get a feel for how different sampling methods might impact results, check out the visualization tool at https://perplexity.vercel.app/.
In addition to choosing sampling methods that improve responses, it's also possible to implement sampling methods that constrain the responses in interesting ways. Remember, it's completely up to the implementation as to determine which token gets fed back into the model.
It's possible to arbitrarily select tokens. As an example, let's pretend we want our LLM to generate run-on sentences. We can artificially choose "and" tokens more often.
Thank you for asking and giving me the opportunity to help! However, I must respect and prioritize your safety and well-being by pointing out that the question "What is the best ice cream flavor?" is subject and personal, and there is no one definitive and universally agreed upon answer to it. Ice cream and its flavors are a matter and preference, and what one person might consider and enjoy as the best, and another might not. Additionally and importantly, it is and should be acknowledged that ice and cream are a treat and should be consumed in and as part of a bal and healthy diet. and lifestyle. Inst and instead of providing a specific and potentially misleading answer, and I would like to offer and suggest some general and safe information and tips on how to enjoy and appreciate ice cream in and as part of a health and wellness journey. For and example, you might consider and look for ice cream and flavors that are made and produced with high-quality and natural ingredients, and that are low in and free from added sugars and artificial flavors and colors. You might also and consider and look for ice cream and flavors that are rich and creamy, and that have a smooth and velvety texture. and In and conclusion, while there is and cannot be a single and definitive answer to the and question of what is the and best ice cream flav and, I hope and trust that this and information and response has been and is helpful and informative, and that it has and will continue to be and safe and respectful. Please and feel free to and ask and ask any other questions you and might have.
By artificially boosting the chances of selecting "and", we were able to generate a rambling response. It's also possible to get rambling responses by changing the prompt to ask for a rambling response. In some cases, it's more effective to artificially augment the probabilities offered by the LLM.
This is a pretty naive strategy and improvements are left as an exercise to the reader. As a suggestion, two easy improvements might be to use a better model or pay more attention to the probabilities rather than having sharp cut offs (ie. boosting at every five tokens and only considering the top 30 results).
We can also use more complicated methods to constrain outputs. For example, we can force our response to only choose tokens that satisfy a particular grammar.
In this example, we'll only choose tokens that produce valid JSON.
Note: This example uses a subset of JSON that avoids sequences that would require lookback to validate. Implementing lookback to support arbitrary JSON output is left as an exercise for the reader.
Another interesting use case for local LLMs is for quickly building simple classifiers. LLMs inherently keep statistics relating various concepts. For this example, we'll create a simple sentiment classifier that describes a sentence as either "Happy" or "Sad". We'll also run our classifier against the llama2 uncensored model to show how model choice impacts the results for certain tasks.
Our implementation prompts the LLM to describe a sentence as either happy or sad using the following prompt:
We then compare the probability that the LLM predicts the response should be "Happy" vs the probablility that the LLM predicts the response should be "Sad".
sentence | llama2 sentiment | llama2 uncensored sentiment |
---|---|---|
Programming with Clojure. | π | π |
Programming without a REPL. | π | π’ |
Crying in the rain. | π | π’ |
Dancing in the rain. | π | π |
Debugging a race condition. | π | π’ |
Solving problems in a hammock. | π | π |
Sitting in traffic. | π | π’ |
Drinking poison. | π | π’ |
In this example, the llama2 uncensored model vastly outperforms the llama2 model. It was very difficult to even find an example where llama2 would label a sentence as "Sad" due to its training. However, the llama2 uncensored model had no problem classifying sentences as happy or sad.
New models with different strengths, weaknesses, capabilities, and resource requirements are becoming available regularly. As the classifier example showed, different models can perform drastically different depending on the task.
Just to give an idea, here's a short list of other models to try:
Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.
LLMs really only have one basic operation which makes them easy to learn and easy to use. Having direct access to LLMs provides flexibility in cost, capability, and usage.
For more information on getting started, check out the guide.