ToC
Indexβ€’Generated with Clerk from notebooks/intro.clj

Intro to Running LLMs Locally

This guide covers the what, how, and why of running LLMs locally using llama.clj, a clojure wrapper for the llama.cpp library.

Large language models (LLMs) are tools that are quickly growing in popularity. Typically, they are used via an API or service. However, many models are available to download and run locally even with modest hardware.

The One Basic Operation

From the perspective of using an LLM, there's really only one basic operation:

Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.

That's basically it. All other usage derives from this one basic operation.

Recreating the Chat Interface

If you've interacted with an LLM, it's probably while using one of the various chat interfaces. Before exploring other usages of local LLMs, we'll first explain how a chat interface can be implemented.

Tokens

Keen readers may have already noticed that chat interfaces work with text, but LLMs work with tokens. Choosing how to bridge the gap between text and tokens is an interesting topic for creating LLMs, but it's not important for understanding how to run LLMs locally. All we need to know is that text can be tokenized into tokens and vice versa.

Just to get a sense of the differences between tokens and text, let's look at how the llama2 7b chat model tokenizes text.

(def sentence "The quick brown fox jumped over the lazy dog.")
(def tokens
(llutil/tokenize llama-context sentence))
[450 4996 17354 1701 29916 12500 287 975 278 17366 11203 29889]

One thing to notice is that there are fewer tokens than letters:

(count tokens)
12
(count sentence)
45

If we untokenize each token, we can see that tokens are often whole words, but not always.

(llutil/untokenize llama-context tokens)
"
The quick brown fox jumped over the lazy dog."

Just to get a feel for a typical tokenizer, we'll look at some basic stats.

Number of tokens:

32000

The longest token:

27097 "
________________"

Token with the most spaces:

462 "
"

One last caveat to watch out for when converting between tokens and text is that not every individual token produces a valid utf-8 string. It may require multiple tokens before a valid utf-8 string is available.

(def smiley-tokens (llutil/tokenize llama-context "😊"))
[29871 243 162 155 141]
(def smiley-untokens
(into []
(map
(fn [token]
[token (llutil/untokenize llama-context [token])]))
smiley-tokens))
[[29871 "
"]
[243 "
"]
[162 "
οΏ½"]
[155 "
οΏ½"]
[141 "
οΏ½"]]

Fortunately, llama.clj has a utility for untokenizing that will take care of the issue:

(llutil/untokenize llama-context smiley-tokens)
"
😊"

Prediction

Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.

Returning to the one basic operation, we now know how to translate between text and tokens. Let's now turn to how prediction works.

While our description of the one basic operation says that LLMs calculates probabilities, that's not completely accurate. Instead, LLMs calculate logits which are slightly different. Even though logits aren't actually probabilities, we can mostly ignore the details except to say that larger logits indicate higher probability and smaller logits indicate lower probability.

Let's take a look at the logits for the prompt "Clojure is a".

(def clojure-is-a-logits
(get-logits llama-context "Clojure is a"))
[-4.450607 -7.2189293 3.0855832 -2.5999398 -2.272163 -0.7203618 -3.7306588 -4.0318084 -2.1686573 0.22644448 -2.1080518 -2.9358313 0.49489853 5.3835487 -2.1657727 -3.3467236 -4.448245 -1.9910841 -1.6820588 -3.0362215 31980 more elided]

clojure-is-a-logits is an array of numbers. The number of logits is 32,000 which is the number of tokens our model can represent. Each index in the array is proportional to the probability that the corresponding token will come next according to our LLM.

Given that higher numbers are more probable, let's see what the top 10 candidates are:

(def highest-probability-candidates
(->> clojure-is-a-logits
;; keep track of index
(map-indexed (fn [idx p]
[idx p]))
;; take the top 10
(sort-by second >)
(take 10)
(map (fn [[idx _p]]
(llutil/untokenize llama-context [idx])))))
("
programming"
"
modern"
"
dynamic"
"
L"
"
functional"
"
relatively"
"
stat"
"
fasc"
"
powerful"
"
language")
Highest Probability Candidates
Clojure is aprogramming
Clojure is amodern
Clojure is adynamic
Clojure is aL
Clojure is afunctional
Clojure is arelatively
Clojure is astat
Clojure is afasc
Clojure is apowerful
Clojure is alanguage

And for comparison, let's look at the 10 least probable candidates:

(def lowest-probability-candidates
(->> clojure-is-a-logits
;; keep track of index]
(map-indexed (fn [idx p]
[idx p]))
;; take the bottom 10
(sort-by second)
(take 10)
(map (fn [[idx _p]]
(llutil/untokenize llama-context [idx])))))
("
Portail"
"
Zygote"
"
accuracy"
"
Ε°"
"
textt"
"
archivi"
"
bern"
"
="."
"
osob"
"
Encyclopedia")
Lowest Probability Candidates
Clojure is aPortail
Clojure is aZygote
Clojure is aaccuracy
Clojure is aΕ°
Clojure is atextt
Clojure is aarchivi
Clojure is abern
Clojure is a=".
Clojure is aosob
Clojure is aEncyclopedia

As you can see, the model does a pretty good job of finding likely and unlikely continuations.

Full Response Generation

Generating probabilities for the very next token is interesting, but not very useful by itself. What we really want is a full response. The way we do that is by using the probabilities to pick the next token, then append that token to our initial prompt, then retrieve new logits from our model, then rinse and repeat.

One of the decisions that most LLM APIs hide is the method for choosing the next token. In principle, we can choose any token and keep going (just as we were able to choose the initial prompt). The name for choosing the next token using the logits provided by the LLM is called sampling.

Choosing a sampling method is an interesting topic unto itself, but for now, we'll go with the most obvious method. We'll choose the token with the highest likelihood given by the model. Sampling using the highest likelihood option is called greedy sampling. Conventionally, greedy sampling isn't the best sampling method, but it's easy to understand and works well enough.

Ok, so we now have a plan for generating a full response:

  1. Feed our initial prompt into our model
  2. Sample the next token using greedy sampling
  3. Return to step #1 with the sampled token appended to our previous prompt

But wait! How do we know when to stop? LLMs define a token that llama.cpp calls end of sentence or eos for short (end of stream would be a more appropriate name, but oh well). We can repeat steps #1-3 until the eos token is the most likely token.

Finally, one last note before we generate a response is that chat models typically have a prompt format. The prompt format is a bit arbitrary and different models will have different prompt formats. Since the prompt format is defined by the model, users of models should check the documentation for the model being used.

Since, we're using llama2's 7b chat model, the prompt format is as follows:

(defn llama2-prompt
"Meant to work with llama-2-7b-chat"
[prompt]
(str
"[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
" prompt " [/INST]
"))
#object[intro$llama2_prompt 0x44704ce8 "
intro$llama2_prompt@44704ce8"
]

Let's see how llama2 describes Clojure.

(def response-tokens
(loop [tokens (llutil/tokenize llama-context
(llama2-prompt "Describe Clojure in one sentence."))]
(let [logits (get-logits llama-context tokens)
;; greedy sampling
token (->> logits
(map-indexed (fn [idx p]
[idx p]))
(apply max-key second)
first)]
(if (= token (llama/eos))
tokens
(recur (conj tokens token))))))
[518 25580 29962 3532 14816 29903 6778 13 3492 526 263 8444 29892 3390 1319 322 15993 20255 29889 29849 181 more elided]
(def response
(llutil/untokenize llama-context response-tokens))
"
[INST] <<SYS>>β†©οΈŽYou are a helpful, respectful and honest assistant. Always answer773 more elided"

See llama2's response below. Note that the response includes the initial prompt since the way we generate responses simply appends new tokens to the initial prompt. However, most utilities in llama.clj strip the initial prompt since we're usually only interested in the answer generated by the LLM.

[INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>

Describe Clojure in one sentence. [/INST] Clojure is a functional programming language for the Java Virtual Machine (JVM) that combines the benefits of Lisp-1 culture with the convenience of a compiled language, allowing developers to write elegant and efficient code with a strong focus on immutability and simplicity.

Let's ask a follow up question. All we need to do is keep appending prompts and continue generating more tokens.

(def response-tokens2
(loop [tokens
(into response-tokens
(llutil/tokenize llama-context
(str
"[INST]"
"Can I use it to write a web app?"
"[/INST]"
)))]
(let [logits (get-logits llama-context tokens)
;; greedy sampling
token (->> logits
(map-indexed (fn [idx p]
[idx p]))
(apply max-key second)
first)]
(if (= token (llama/eos))
tokens
(recur (conj tokens token))))))
[518 25580 29962 3532 14816 29903 6778 13 3492 526 263 8444 29892 3390 1319 322 15993 20255 29889 29849 367 more elided]
(def response2
(llutil/untokenize llama-context response-tokens2))
"
[INST] <<SYS>>β†©οΈŽYou are a helpful, respectful and honest assistant. Always answer1669 more elided"

[INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>

Describe Clojure in one sentence. [/INST] Clojure is a functional programming language for the Java Virtual Machine (JVM) that combines the benefits of Lisp-1 culture with the convenience of a compiled language, allowing developers to write elegant and efficient code with a strong focus on immutability and simplicity. [INST]Can I use it to write a web app?[/INST] Yes, Clojure can be used to write web applications. In fact, Clojure has a rich ecosystem of libraries and tools for building web applications, including popular frameworks like Ring and Compojure. These frameworks provide a simple and efficient way to build web servers, handle HTTP requests and responses, and interact with databases. Clojure's functional programming model also makes it well-suited for writing concurrent and parallel code, which can be useful for building scalable web applications that can handle a high volume of traffic. Additionally, Clojure's immutable data structures can help prevent common web application problems like race conditions and data corruption. Overall, Clojure is a great choice for building web applications, and its rich ecosystem and strong community make it easy to find help and resources when needed.

We've now implemented a simple chat interface using the one basic operation that LLMs offer! To recap, LLMs work by calculating the likelihood of all tokens given a prompt. Our basic process for implementing the chat interface was:

  1. Feed our prompt into the LLM using the prompt structure specified by our chosen LLM.
  2. Sample the next token greedily and feed it back into the LLM.
  3. Repeated the process until we reached the end of sentence (eos) token.

Reasons for Running LLMs Locally

Now that we have a general sense of how LLMs work, we'll explore other ways to use LLMs and reasons for running LLMs locally rather than using LLMs through an API.

Privacy

One reason to run LLMs locally rather than via an API is making sure that sensitive or personal data isn't bouncing around the internet unnecessarily. Data privacy is important for both individual use as well as protecting data on behalf of users and customers.

Alternative Sampling Methods

Sampling is the method used for choosing the next token given the logits returned from an LLM. Our chat interface example used greedy sampling, but choosing the next token by always selecting the highest likelihood token often does not lead to the best results. The intuition for greedy sampling's poor performance is that always picking the highest probability tokens often leads to boring, uninteresting, and repetitive results.

Let's compare greedy sampling vs mirostatv2, llama.clj's default sampling method:

(def prompt
(llama2-prompt "What is the best ice cream flavor?"))
"
[INST] <<SYS>>β†©οΈŽYou are a helpful, respectful and honest assistant. Always answer497 more elided"
(def mirostat-response
(llama/generate-string llama-context
prompt
{:seed 1234}))
"
Thank you for asking! However, I must respectfully point out that the question "657 more elided"

mirostatv2 response:

Thank you for asking! However, I must respectfully point out that the question "What is the best ice cream flavor?" is subjective and cannot be answered definitively. People have different preferences when it comes to ice cream flavors, and what one person might consider the best, another might not. Instead, I can offer some popular ice cream flavors that are loved by many:

  • Vanilla
  • Chocolate
  • Cookies and Cream
  • Mint Chocolate Chip
  • Strawberry
  • Rocky Road
  • Brownie Batter
  • Salted Caramel
  • Matcha Green Tea

Of course, these are just a few examples, and there are countless other delicious ice cream flavors to choose from! Feel free to share your favorite ice cream flavor with me, and I'll make sure to add it to the list.

(def greedy-response
(llama/generate-string llama-context
prompt
{:samplef llama/sample-logits-greedy}))
"
Thank you for asking! I'm glad you're interested in ice cream flavors. However, 739 more elided"

greedy response:

Thank you for asking! I'm glad you're interested in ice cream flavors. However, I must respectfully point out that the question "What is the best ice cream flavor?" is subjective and can vary from person to person. Different people have different preferences when it comes to ice cream flavors, and there is no one "best" flavor that is universally agreed upon. Instead, I suggest we focus on exploring the different ice cream flavors available and finding one that suits your taste buds. There are so many delicious flavors to choose from, such as classic vanilla, rich chocolate, fruity strawberry, and creamy caramel. You can also try mixing and matching different flavors to create your own unique taste. Remember, the most important thing is to enjoy your ice cream and have fun exploring the different flavors! 😊

Evaluating the outputs of LLMs is a bit of a dark art which makes picking a sampling method difficult. Regardless, choosing or implementing the right sampling method can make a big difference in the quality of the result.

To get a feel for how different sampling methods might impact results, check out the visualization tool at https://perplexity.vercel.app/.

Constrained Sampling Methods

In addition to choosing sampling methods that improve responses, it's also possible to implement sampling methods that constrain the responses in interesting ways. Remember, it's completely up to the implementation as to determine which token gets fed back into the model.

Run On Sentences

It's possible to arbitrarily select tokens. As an example, let's pretend we want our LLM to generate run-on sentences. We can artificially choose "and" tokens more often.

(def run-on-response
(let [and-token (first (llutil/tokenize llama-context "and"))
prev-tokens (volatile! [])]
(llama/generate-string
llama-context
prompt
{:samplef
(fn [logits]
(let [greedy-token (llama/sample-logits-greedy logits)
;; sort the logits in descending order with indexes
top (->> logits
(map-indexed vector)
(sort-by second >))
;; find the index of the and token
idx (->> top
(map first)
(map-indexed vector)
(some (fn [[i token]]
(when (= token and-token)
i))))
next-token
;; pick the and token if we haven't used it in the last
;; 5 tokens and if it's in the top 30 results
(if (and (not (some #{and-token} (take-last 5 @prev-tokens)))
(< idx 30)
(not= (llama/eos llama-context) greedy-token))
and-token
greedy-token)]
(vswap! prev-tokens conj next-token)
next-token))})))
"
Thank you for asking and giving me the opportunity to help! However, I must resp1516 more elided"

Thank you for asking and giving me the opportunity to help! However, I must respect and prioritize your safety and well-being by pointing out that the question "What is the best ice cream flavor?" is subject and personal, and there is no one definitive and universally agreed upon answer to it. Ice cream and its flavors are a matter and preference, and what one person might consider and enjoy as the best, and another might not. Additionally and importantly, it is and should be acknowledged that ice and cream are a treat and should be consumed in and as part of a bal and healthy diet. and lifestyle. Inst and instead of providing a specific and potentially misleading answer, and I would like to offer and suggest some general and safe information and tips on how to enjoy and appreciate ice cream in and as part of a health and wellness journey. For and example, you might consider and look for ice cream and flavors that are made and produced with high-quality and natural ingredients, and that are low in and free from added sugars and artificial flavors and colors. You might also and consider and look for ice cream and flavors that are rich and creamy, and that have a smooth and velvety texture. and In and conclusion, while there is and cannot be a single and definitive answer to the and question of what is the and best ice cream flav and, I hope and trust that this and information and response has been and is helpful and informative, and that it has and will continue to be and safe and respectful. Please and feel free to and ask and ask any other questions you and might have.

By artificially boosting the chances of selecting "and", we were able to generate a rambling response. It's also possible to get rambling responses by changing the prompt to ask for a rambling response. In some cases, it's more effective to artificially augment the probabilities offered by the LLM.

This is a pretty naive strategy and improvements are left as an exercise to the reader. As a suggestion, two easy improvements might be to use a better model or pay more attention to the probabilities rather than having sharp cut offs (ie. boosting at every five tokens and only considering the top 30 results).

JSON Output

We can also use more complicated methods to constrain outputs. For example, we can force our response to only choose tokens that satisfy a particular grammar.

In this example, we'll only choose tokens that produce valid JSON.

Note: This example uses a subset of JSON that avoids sequences that would require lookback to validate. Implementing lookback to support arbitrary JSON output is left as an exercise for the reader.

(def json-parser
(insta/parser (slurp
(io/resource "resources/json.peg"))))
{:grammar {:NUMBER {:red {:reduction-type :raw} :regexp #"
[0-9]+"
2 more elided :tag :regexp}
:STRING {:parsers ({:string "
""
:tag :string}
{:regexp #"
[a-zA-Z 0-9]*"
2 more elided :tag :regexp}
{:string "
""
:tag :string})
:red {:reduction-type :raw} :tag :cat}
:WS {:parsers ({:string "
"
:tag :string}
{:string "
"
:tag :string}
{:string "
"
:tag :string}
{:string "
"
:tag :string})
:red {:reduction-type :raw} :tag :alt}
:jsonArray {:parsers ({:string "
["
:tag :string}
{:hide true :parser {:keyword :WS :tag :nt} :tag :star} {:parser {:parsers ({:keyword :jsonValue :tag :nt} {:parser {:parsers ({:hide true :parser {:keyword :WS :tag :nt} :tag :star} {:string "
,"
:tag :string}
{:hide true :parser {:keyword :WS :tag :nt} :tag :star} {:keyword :jsonValue :tag :nt})
:tag :cat}
:tag :star})
:tag :cat}
:tag :opt}
{3 more elided} {2 more elided})
:red {2 more elided} :tag :cat}
:jsonNumber {3 more elided} :jsonObject {3 more elided} :jsonString {3 more elided} :jsonText {3 more elided} :jsonValue {3 more elided} :member {3 more elided}}
:output-format :hiccup :start-production :jsonText}
(def json-response
(let [prev-tokens (volatile! [])
done? (volatile! false)]
(llama/generate-string
llama-context
(llama2-prompt "Describe some pizza toppings using JSON.")
{:samplef
(fn [logits]
(if @done?
(llama/eos llama-context)
(let [sorted-logits (->> logits
(map-indexed vector)
(sort-by second >))
first-jsonable
(->> sorted-logits
(map first)
(some (fn [token]
(when-let [s (try
(llutil/untokenize llama-context (conj @prev-tokens token))
(catch Exception e))]
(let [parse (insta/parse json-parser s)
tokens (llutil/untokenize llama-context [token])]
(cond
;; ignore whitespace
(re-matches #"\s+" tokens) false
(insta/failure? parse)
(let [{:keys [index]} parse]
(if (= index (count s))
;; potentially parseable
token
;; return false to keep searching
false))
:else (do
(vreset! done? true)
token)))))))]
(vswap! prev-tokens conj first-jsonable)
(if (Thread/interrupted)
(llama/eos llama-context)
first-jsonable))))})))
"
{ "toppings": [ { "name": "Pepperoni", "description": "A classic topping made fr720 more elided"
{ "toppings": [ { "name": "Pepperoni", "description": "A classic topping made from cured and smoked pork sausage" }, { "name": "Mushrooms", "description": "Sliced or whole mushrooms that add a meaty texture and earthy flavor to the pizza" }, { "name": "Onions", "description": "Thinly sliced or diced onions that add a pungent flavor and crunchy texture" }, { "name": "Green peppers", "description": "Sliced or diced green peppers that add a crunchy texture and slightly sweet flavor" }, { "name": "Bacon", "description": "Crispy bacon bits that add a smoky flavor and satisfying crunch" }, { "name": "Ham", "description": "Thinly sliced or diced ham that adds a salty flavor and meaty texture" }, { "name": "Pineapple", "description": "Fresh pineapple chunks that add a sweet and tangy flavor" } ] }

Classifiers

Another interesting use case for local LLMs is for quickly building simple classifiers. LLMs inherently keep statistics relating various concepts. For this example, we'll create a simple sentiment classifier that describes a sentence as either "Happy" or "Sad". We'll also run our classifier against the llama2 uncensored model to show how model choice impacts the results for certain tasks.

(defn llama2-uncensored-prompt
"Meant to work with models/llama2_7b_chat_uncensored"
[prompt]
(str "### HUMAN:
" prompt "
### RESPONSE:
"))
#object[intro$llama2_uncensored_prompt 0x6777bf32 "
intro$llama2_uncensored_prompt@6777bf32"
]
(defn softmax
"Converts logits to probabilities. More optimal softmax implementations exist that avoid overflow."
[values]
(let [exp-values (mapv #(Math/exp %) values)
sum-exp-values (reduce + exp-values)]
(mapv #(/ % sum-exp-values) exp-values)))
#object[intro$softmax 0x2126a838 "
intro$softmax@2126a838"
]

Our implementation prompts the LLM to describe a sentence as either happy or sad using the following prompt:

(str "Give a one word answer of \"Happy\" or \"Sad\" for describing the following sentence: " sentence)

We then compare the probability that the LLM predicts the response should be "Happy" vs the probablility that the LLM predicts the response should be "Sad".

(defn happy-or-sad? [llama-context format-prompt sentence]
(let [ ;; two tokens each
happy-token (first (llutil/tokenize llama-context "Happy"))
sad-token (first (llutil/tokenize llama-context "Sad"))
prompt (format-prompt
(str "Give a one word answer of \"Happy\" or \"Sad\" for describing the following sentence: " sentence " "))
prompt-tokens (llutil/tokenize llama-context prompt)
_ (llama/llama-update llama-context (llama/bos) 0)
_ (doseq [token prompt-tokens]
(llama/llama-update llama-context token))
;; check happy and sad probabilities for first tokens
logits (llama/get-logits llama-context)
probs (softmax logits)
happy-prob (nth probs happy-token)
sad-prob (nth probs sad-token)]
{:emoji (if (> happy-prob sad-prob)
"😊"
"😒")
;; :response (llama/generate-string llama-context prompt {:samplef llama/sample-logits-greedy})
:happy happy-prob
:sad sad-prob
:happy-prob happy-prob
:sad-prob sad-prob}))
#object[intro$happy_or_sad_QMARK_ 0x75f12f98 "
intro$happy_or_sad_QMARK_@75f12f98"
]
(def queries
["Programming with Clojure."
"Programming without a REPL."
"Crying in the rain."
"Dancing in the rain."
"Debugging a race condition."
"Solving problems in a hammock."
"Sitting in traffic."
"Drinking poison."])
["
Programming with Clojure."
"
Programming without a REPL."
"
Crying in the rain."
"
Dancing in the rain."
"
Debugging a race condition."
"
Solving problems in a hammock."
"
Sitting in traffic."
"
Drinking poison."]
sentencellama2 sentimentllama2 uncensored sentiment
Programming with Clojure.😊😊
Programming without a REPL.😊😒
Crying in the rain.😊😒
Dancing in the rain.😊😊
Debugging a race condition.😊😒
Solving problems in a hammock.😊😊
Sitting in traffic.😊😒
Drinking poison.😊😒

In this example, the llama2 uncensored model vastly outperforms the llama2 model. It was very difficult to even find an example where llama2 would label a sentence as "Sad" due to its training. However, the llama2 uncensored model had no problem classifying sentences as happy or sad.

More Models Options

New models with different strengths, weaknesses, capabilities, and resource requirements are becoming available regularly. As the classifier example showed, different models can perform drastically different depending on the task.

Just to give an idea, here's a short list of other models to try:

  • metharme-7b: This is an experiment to try and get a model that is usable for conversation, roleplaying and storywriting, but which can be guided using natural language like other instruct models.
  • GPT4All: GPT4All is an ecosystem to train and deploy powerful and customized large language models that run locally on consumer grade CPUs.
  • OpenLLamMa: we are releasing our public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA
  • ALMA: ALMA (Advanced Language Model-based trAnslator) is an LLM-based translation model, which adopts a new translation model paradigm: it begins with fine-tuning on monolingual data and is further optimized using high-quality parallel data. This two-step fine-tuning process ensures strong translation performance.
  • LlaMa-2 Coder: LlaMa-2 7b fine-tuned on the CodeAlpaca 20k instructions dataset by using the method QLoRA with PEFT library.

Conclusion

Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.

LLMs really only have one basic operation which makes them easy to learn and easy to use. Having direct access to LLMs provides flexibility in cost, capability, and usage.

Next Steps

For more information on getting started, check out the guide.