ToC
Indexβ€’Generated with Clerk from notebooks/intro.clj

Intro to Running LLMs Locally

This guide covers the what, how, and why of running LLMs locally using llama.clj, a clojure wrapper for the llama.cpp library.

Large language models (LLMs) are tools that are quickly growing in popularity. Typically, they are used via an API or service. However, many models are available to download and run locally even with modest hardware.

The One Basic Operation

From the perspective of using an LLM, there's really only one basic operation:

Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.

That's basically it. All other usage derives from this one basic operation.

Recreating the Chat Interface

If you've interacted with an LLM, it's probably while using one of the various chat interfaces. Before exploring other usages of local LLMs, we'll first explain how a chat interface can be implemented.

Tokens

Keen readers may have already noticed that chat interfaces work with text, but LLMs work with tokens. Choosing how to bridge the gap between text and tokens is an interesting topic for creating LLMs, but it's not important for understanding how to run LLMs locally. All we need to know is that text can be tokenized into tokens and vice versa.

Just to get a sense of the differences between tokens and text, let's look at how the llama2 7b chat model tokenizes text.

(def sentence "The quick brown fox jumped over the lazy dog.")
(def tokens
(llutil/tokenize llama-context sentence))
[1576 4996 17354 1701 29916 12500 287 975 278 17366 11203 29889]

One thing to notice is that there are fewer tokens than letters:

(count tokens)
12
(count sentence)
45

If we untokenize each token, we can see that tokens are often whole words, but not always.

(mapv #(raw/llama_token_to_str llama-context %)
tokens)
["
The"
"
quick"
"
brown"
"
fo"
"
x"
"
jump"
"
ed"
"
over"
"
the"
"
lazy"
"
dog"
"
."]

Just to get a feel for a typical tokenizer, we'll look at some basic stats.

Number of tokens:

32000

The longest token:

29304 "
transformations"

Token with the most spaces:

462 "
"

One last caveat to watch out for when converting between tokens and text is that not every token produces a valid utf-8 string. It may require multiple tokens before a valid utf-8 string is available.

(def smiley-tokens (llutil/tokenize llama-context "😊"))
[243 162 155 141]
(def smiley-untokens (mapv #(raw/llama_token_to_str llama-context %)
smiley-tokens))
["
οΏ½"
"
οΏ½"
"
οΏ½"
"
οΏ½"]

Fortunately, llama.clj has a utility for untokenizing that will take care of the issue:

(llutil/untokenize llama-context smiley-tokens)
"
😊"

Prediction

Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.

Returning to the one basic operation, we now know how to translate between text and tokens. Let's now turn to how prediction works.

While our description of the one basic operation says that LLMs calculates probabilities, that's not completely accurate. Instead, LLMs calculate logits which are slightly different. Even though logits aren't actually probabilities, we can mostly ignore the details except to say that larger logits indicate higher probability and smaller logits indicate lower probability.

Let's take a look at the logits for the prompt "Clojure is a".

(def clojure-is-a-logits
(get-logits llama-context "Clojure is a"))
[-4.18457 -7.10481 3.4147239 -2.617646 -2.370878 -0.47363472 -3.822948 -4.237583 -1.9167244 0.09818697 -1.9291447 -3.0881236 0.70356405 5.117277 -1.96538 -3.3311884 -4.1881514 -2.1489651 -1.8183191 -2.8390856 31980 more elided]

clojure-is-a-logits is an array of numbers. The number of logits is 32,000 which is the number of tokens our model can represent. Each index in the array is proportional to the probability that the corresponding token will come next according to our LLM.

Given that higher numbers are more probable, let's see what the top 10 candidates are:

(def highest-probability-candidates
(->> clojure-is-a-logits
;; keep track of index
(map-indexed (fn [idx p]
[idx p]))
;; take the top 10
(sort-by second >)
(take 10)
(map (fn [[idx _p]]
(llutil/untokenize llama-context [idx])))))
("
programming"
"
dynamic"
"
modern"
"
L"
"
relatively"
"
functional"
"
stat"
"
fasc"
"
language"
"
powerful")
Highest Probability Candidates
Clojure is a programming
Clojure is a dynamic
Clojure is a modern
Clojure is a L
Clojure is a relatively
Clojure is a functional
Clojure is a stat
Clojure is a fasc
Clojure is a language
Clojure is a powerful

And for comparison, let's look at the 10 least probable candidates:

(def lowest-probability-candidates
(->> clojure-is-a-logits
;; keep track of index]
(map-indexed (fn [idx p]
[idx p]))
;; take the bottom 10
(sort-by second)
(take 10)
(map (fn [[idx _p]]
(llutil/untokenize llama-context [idx])))))
("
Portail"
"
Zygote"
"
accuracy"
"
archivi"
"
textt"
"
Ε°"
"
bern"
"
="."
"
Autor"
"
osob")
Lowest Probability Candidates
Clojure is aPortail
Clojure is aZygote
Clojure is a accuracy
Clojure is aarchivi
Clojure is atextt
Clojure is aΕ°
Clojure is abern
Clojure is a=".
Clojure is a Autor
Clojure is a osob

As you can see, the model does a pretty good job of finding likely and unlikely continuations.

Full Response Generation

Generating probabilities for the very next token is interesting, but not very useful by itself. What we really want is a full response. The way we do that is by using the probabilities to pick the next token, then append that token to our initial prompt, then retrieve new logits from our model, then rinse and repeat.

One of the decisions that most LLM APIs hide is the method for choosing the next token. In principle, we can choose any token and keep going (just as we were able to choose the initial prompt). The name for choosing the next token using the logits provided by the LLM is called sampling.

Choosing a sampling method is an interesting topic unto itself, but for now, we'll go with the most obvious method. We'll choose the token with the highest likelihood given by the model. Sampling using the highest likelihood option is called greedy sampling. Conventionally, greedy sampling isn't the best sampling method, but it's easy to understand and works well enough.

Ok, so we now have a plan for generating a full response:

  1. Feed our initial prompt into our model
  2. Sample the next token using greedy sampling
  3. Return to step #1 with the sampled token appended to our previous prompt

But wait! How do we know when to stop? LLMs define a token that llama.cpp calls end of sentence or eos for short (end of stream would be a more appropriate name, but oh well). We can repeat steps #1-3 until the eos token is the most likely token.

Finally, one last note before we generate a response is that chat models typically have a prompt format. The prompt format is a bit arbitrary and different models will have different prompt formats. Since the prompt format is defined by the model, users of models should check the documentation for the model being used.

Since, we're using llama2's 7b chat model, the prompt format is as follows:

(defn llama2-prompt
"Meant to work with llama-2-7b-chat.ggmlv3.q4_0.bin"
[prompt]
(str
"[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
" prompt " [/INST]
"))
#object[intro$llama2_prompt 0x426f3596 "
intro$llama2_prompt@426f3596"
]

Let's see how llama2 describes Clojure.

(def response-tokens
(loop [tokens (llutil/tokenize llama-context
(llama2-prompt "Describe Clojure in one sentence."))]
(let [logits (get-logits llama-context tokens)
;; greedy sampling
token (->> logits
(map-indexed (fn [idx p]
[idx p]))
(apply max-key second)
first)]
(if (= token (llama/eos))
tokens
(recur (conj tokens token))))))
[29961 25580 29962 3532 14816 29903 6778 13 3492 526 263 8444 29892 3390 1319 322 15993 20255 29889 29849 178 more elided]
(def response
(llutil/untokenize llama-context response-tokens))
"
[INST] <<SYS>>β†©οΈŽYou are a helpful, respectful and honest assistant. Always answer736 more elided"

See llama2's response below. Note that the response includes the initial prompt since the way we generate responses simply appends new tokens to the initial prompt. However, most utilities in llama.clj strip the initial prompt since we're usually only interested in the answer generated by the LLM.

[INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>

Describe Clojure in one sentence. [/INST] Clojure is a programming language that runs on the Java Virtual Machine (JVM) and is designed to be a functional programming language with a syntax inspired by Lisp, providing a unique blend of concise syntax, immutability, and performance.

Let's ask a follow up question. All we need to do is keep appending prompts and continue generating more tokens.

(def response-tokens2
(loop [tokens
(into response-tokens
(llutil/tokenize llama-context
(str
"[INST]"
"Can I use it to write a web app?"
"[/INST]"
)))]
(let [logits (get-logits llama-context tokens)
;; greedy sampling
token (->> logits
(map-indexed (fn [idx p]
[idx p]))
(apply max-key second)
first)]
(if (= token (llama/eos))
tokens
(recur (conj tokens token))))))
[29961 25580 29962 3532 14816 29903 6778 13 3492 526 263 8444 29892 3390 1319 322 15993 20255 29889 29849 366 more elided]
(def response2
(llutil/untokenize llama-context response-tokens2))
"
[INST] <<SYS>>β†©οΈŽYou are a helpful, respectful and honest assistant. Always answer1602 more elided"

[INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>

Describe Clojure in one sentence. [/INST] Clojure is a programming language that runs on the Java Virtual Machine (JVM) and is designed to be a functional programming language with a syntax inspired by Lisp, providing a unique blend of concise syntax, immutability, and performance.[INST]Can I use it to write a web app?[/INST] Yes, Clojure can be used to write web applications. In fact, Clojure has a rich ecosystem of libraries and tools for building web applications, including popular frameworks like Ring and Compojure. These frameworks provide a set of tools and conventions for building web servers, handling HTTP requests and responses, and working with databases.

Clojure's functional programming model and immutable data structures can also help make web applications more maintainable and scalable, as they are less prone to bugs and easier to reason about.

However, it's worth noting that Clojure is not a traditional web development language, and it may take some time to get used to its unique syntax and programming paradigm. But with the right resources and support, it can be a very powerful tool for building web applications.

We've now implemented a simple chat interface using the one basic operation that LLMs offer! To recap, LLMs work by calculating the likelihood of all tokens given a prompt. Our basic process for implementing the chat interface was:

  1. Feed our prompt into the LLM using the prompt structure specified by our chosen LLM.
  2. Sample the next token greedily and feed it back into the LLM.
  3. Repeated the process until we reached the end of sentence (eos) token.

Reasons for Running LLMs Locally

Now that we have a general sense of how LLMs work, we'll explore other ways to use LLMs and reasons for running LLMs locally rather than using LLMs through an API.

Privacy

One reason to run LLMs locally rather than via an API is making sure that sensitive or personal data isn't bouncing around the internet unnecessarily. Data privacy is important for both individual use as well as protecting data on behalf of users and customers.

Alternative Sampling Methods

Sampling is the method used for choosing the next token given the logits returned from an LLM. Our chat interface example used greedy sampling, but choosing the next token by always selecting the highest likelihood token often does not lead to the best results. The intuition for greedy sampling's poor performance is that always picking the highest probability tokens often leads to boring, uninteresting, and repetitive results.

Let's compare greedy sampling vs mirostatv2, llama.clj's default sampling method:

(def prompt
(llama2-prompt "What is the best ice cream flavor?"))
"
[INST] <<SYS>>β†©οΈŽYou are a helpful, respectful and honest assistant. Always answer497 more elided"
(def mirostat-response
(llama/generate-string llama-context
prompt
{:seed 1234}))
"
Thank you for asking! I'm happy to help you with that. However, I must point out1000 more elided"

mirostatv2 response:

Thank you for asking! I'm happy to help you with that. However, I must point out that the question "What is the best ice cream flavor?" is quite subjective and can vary from person to person. Ice cream lovers have different preferences when it comes to flavors, textures, and sweetness levels.

Instead of giving you a definitive answer, I'll provide some popular ice cream flavors that people enjoy:

  1. Vanilla: A classic and versatile flavor that pairs well with many toppings.
  2. Chocolate: For those with a sweet tooth, chocolate ice cream is a timeless favorite.
  3. Cookies and Cream: This flavor combines the creaminess of ice cream with the crunch of cookies, creating a delicious treat.
  4. Mint Choc Chip: For those who enjoy a refreshing and cooling taste, mint choc chip is a great option.
  5. Salted Caramel: This flavor offers a unique blend of salty and sweet, with a smooth and creamy texture.

Remember, the best ice cream flavor is the one that you enjoy the most! So, feel free to explore different flavors and find the one that suits your taste buds the best. πŸ˜‹

(def greedy-response
(llama/generate-string llama-context
prompt
{:samplef llama/sample-logits-greedy}))
"
Thank you for asking! I'm glad you're interested in ice cream flavors. However, 795 more elided"

greedy response:

Thank you for asking! I'm glad you're interested in ice cream flavors. However, I must respectfully point out that the question "What is the best ice cream flavor?" is subjective and can vary from person to person. Different people have different preferences when it comes to ice cream flavors, and there is no one "best" flavor that is universally agreed upon.

Instead, I suggest we focus on exploring the different types of ice cream flavors and their unique characteristics. For example, some popular ice cream flavors include vanilla, chocolate, strawberry, and cookie dough. Each of these flavors has its own distinct taste and texture, and there are many variations and combinations to try as well.

So, while there may not be a single "best" ice cream flavor, there are certainly plenty of delicious options to choose from! Is there anything else I can help you with?

Evaluating the outputs of LLMs is a bit of a dark art which makes picking a sampling method difficult. Regardless, choosing or implementing the right sampling method can make a big difference in the quality of the result.

To get a feel for how different sampling methods might impact results, check out the visualization tool at https://perplexity.vercel.app/.

Constrained Sampling Methods

In addition to choosing sampling methods that improve responses, it's also possible to implement sampling methods that constrain the responses in interesting ways. Remember, it's completely up to the implementation as to determine which token gets fed back into the model.

Run On Sentences

It's possible to arbitrarily select tokens. As an example, let's pretend we want our LLM to generate run-on sentences. We can artificially choose "and" tokens more often.

(def run-on-response
(let [and-token (first (llutil/tokenize llama-context " and"))
prev-tokens (volatile! [])]
(llama/generate-string
llama-context
prompt
{:samplef
(fn [logits]
(let [greedy-token (llama/sample-logits-greedy logits)
;; sort the logits in descending order with indexes
top (->> logits
(map-indexed vector)
(sort-by second >))
;; find the index of the and token
idx (->> top
(map first)
(map-indexed vector)
(some (fn [[i token]]
(when (= token and-token)
i))))
next-token
;; pick the and token if we haven't used it in the last
;; 5 tokens and if it's in the top 30 results
(if (and (not (some #{and-token} (take-last 5 @prev-tokens)))
(< idx 30)
(not= (llama/eos) greedy-token))
and-token
greedy-token)]
(vswap! prev-tokens conj next-token)
next-token))})))
"
Thank you for asking! I'm glad you're interested and excited about ice cream! Ho1158 more elided"

Thank you for asking! I'm glad you're interested and excited about ice cream! However, I must respect and prioritize your safety and well-being by providing a responsible and accurate response.

Unfortunately and as a responsible assistant, and as ice cream is a personal and subjective matter, there is and cannot be a single "best" and universally agreed upon ice and flavor. Different and unique flavors and combinations of ingredients and textures can be enjoyed and appreciated by people with different and diverse tastes and preferences.

Instead and as a positive and socially unbiased and positive assistant, I suggest and recommend exploring and discovering various and diverse ice cream flavors and combinations that suit your individual and personal preferences and tastes. This and by doing so, you and others can enjoy and appreciate the unique and delicious qualities of and in ice cream. and

Remember, and as always, please prior and always consider and respect the safety and well-being of and for yourself and others, and always act and make choices that are responsible and ethical.

I and the AI team hope and wish you a wonderful and enjoyable experience exploring and discovering your favorite ice and cream flavors!

By artificially boosting the chances of selecting "and", we were able to generate a rambling response. It's also possible to get rambling responses by changing the prompt to ask for a rambling response. In some cases, it's more effective to artificially augment the probabilities offered by the LLM.

JSON Output

We can also use more complicated methods to constrain outputs. For example, we can force our response to only choose tokens that satisfy a particular grammar.

In this example, we'll only choose tokens that produce valid JSON.

Note: This example uses a subset of JSON that avoids sequences that would require lookback to validate. Implementing lookback to support arbitrary JSON output is left as an exercise for the reader.

(def json-parser
(insta/parser (slurp
(io/resource "resources/json.peg"))))
{:grammar {:NUMBER {:red {:reduction-type :raw} :regexp #"
[0-9]+"
2 more elided :tag :regexp}
:STRING {:parsers ({:string "
""
:tag :string}
{:regexp #"
[a-zA-Z 0-9]*"
2 more elided :tag :regexp}
{:string "
""
:tag :string})
:red {:reduction-type :raw} :tag :cat}
:WS {:parsers ({:string "
"
:tag :string}
{:string "
"
:tag :string}
{:string "
"
:tag :string}
{:string "
"
:tag :string})
:red {:reduction-type :raw} :tag :alt}
:jsonArray {:parsers ({:string "
["
:tag :string}
{:hide true :parser {:keyword :WS :tag :nt} :tag :star} {:parser {:parsers ({:keyword :jsonValue :tag :nt} {:parser {:parsers ({:hide true :parser {:keyword :WS :tag :nt} :tag :star} {:string "
,"
:tag :string}
{:hide true :parser {:keyword :WS :tag :nt} :tag :star} {:keyword :jsonValue :tag :nt})
:tag :cat}
:tag :star})
:tag :cat}
:tag :opt}
{3 more elided} {2 more elided})
:red {2 more elided} :tag :cat}
:jsonNumber {3 more elided} :jsonObject {3 more elided} :jsonString {3 more elided} :jsonText {3 more elided} :jsonValue {3 more elided} :member {3 more elided}}
:output-format :hiccup :start-production :jsonText}
(def json-response
(let [prev-tokens (volatile! [])]
(llama/generate-string
llama-context
(llama2-prompt "Describe some pizza toppings using JSON.")
{:samplef
(fn [logits]
(let [sorted-logits (->> logits
(map-indexed vector)
(sort-by second >))
first-jsonable
(->> sorted-logits
(map first)
(some (fn [token]
(when-let [s (try
(llutil/untokenize llama-context (conj @prev-tokens token))
(catch Exception e))]
(let [parse (insta/parse json-parser s)
tokens (raw/llama_token_to_str llama-context token)]
(cond
;; ignore whitespace
(re-matches #"\s+" tokens) false
(insta/failure? parse)
(let [{:keys [index]} parse]
(if (= index (count s))
;; potentially parseable
token
;; return false to keep searching
false))
:else token))))))]
(vswap! prev-tokens conj first-jsonable)
(if (Thread/interrupted)
(llama/eos)
first-jsonable)))})))
"
{ "toppings": [ { "name": "Pepperoni", "description": "A classic topping made fr427 more elided"
{ "toppings": [ { "name": "Pepperoni", "description": "A classic topping made from cured and smoked pork sausage" }, { "name": "Mushrooms", "description": "A savory topping made from fresh mushrooms" }, { "name": "Onions", "description": "A sweet and savory topping made from thinly sliced onions" }, { "name": "Green peppers", "description": "A crunchy and slightly sweet topping made from green peppers" }, { "name": "Black olives", "description": "A salty and savory topping made from black olives" } ] }

Classifiers

Another interesting use case for local LLMs is for quickly building simple classifiers. LLMs inherently keep statistics relating various concepts. For this example, we'll create a simple sentiment classifier that describes a sentence as either "Happy" or "Sad". We'll also run our classifier against the llama2 uncensored model to show how model choice impacts the results for certain tasks.

(defn llama2-uncensored-prompt
"Meant to work with models/llama2_7b_chat_uncensored.ggmlv3.q4_0.bin"
[prompt]
(str "### HUMAN:
" prompt "
### RESPONSE:
"))
#object[intro$llama2_uncensored_prompt 0x22af80b "
intro$llama2_uncensored_prompt@22af80b"
]
(defn softmax
"Converts logits to probabilities. More optimal softmax implementations exist that avoid overflow."
[values]
(let [exp-values (mapv #(Math/exp %) values)
sum-exp-values (reduce + exp-values)]
(mapv #(/ % sum-exp-values) exp-values)))
#object[intro$softmax 0x3952be5d "
intro$softmax@3952be5d"
]

Our implementation prompts the LLM to describe a sentence as either happy or sad using the following prompt:

(str "Give a one word answer of \"Happy\" or \"Sad\" for describing the following sentence: " sentence)

We then compare the probability that the LLM predicts the response should be "Happy" vs the probablility that the LLM predicts the response should be "Sad".

(defn happy-or-sad? [llama-context format-prompt sentence]
(let [ ;; two tokens each
[h1 h2] (llutil/tokenize llama-context "Happy")
[s1 s2] (llutil/tokenize llama-context "Sad")
prompt (format-prompt
(str "Give a one word answer of \"Happy\" or \"Sad\" for describing the following sentence: " sentence))
_ (llama/llama-update llama-context prompt 0)
;; check happy and sad probabilities for first tokens
logits (llama/get-logits llama-context)
probs (softmax logits)
hp1 (nth probs h1)
sp1 (nth probs s1)
;; check happy second token
_ (llama/llama-update llama-context h1)
logits (llama/get-logits llama-context)
probs (softmax logits)
hp2 (nth probs h2)
;; check sad second token
_ (llama/llama-update llama-context s1
;; ignore h1
(dec (raw/llama_get_kv_cache_token_count llama-context)))
logits (llama/get-logits llama-context)
probs (softmax logits)
sp2 (nth probs s2)
happy-prob (* hp1 hp2)
sad-prob (* sp1 sp2)]
{:emoji (if (> happy-prob sad-prob)
"😊"
"😒")
;; :response (llama/generate-string llama-context prompt {:samplef llama/sample-logits-greedy})
:happy happy-prob
:sad sad-prob
:hps[hp1 hp2]
:sps [sp1 sp2]}))
#object[intro$happy_or_sad_QMARK_ 0x7a324e97 "
intro$happy_or_sad_QMARK_@7a324e97"
]
(def queries
["Programming with Clojure."
"Programming with monads."
"Crying in the rain."
"Dancing in the rain."
"Debugging a race condition."
"Solving problems in a hammock."
"Sitting in traffic."
"Drinking poison."])
["
Programming with Clojure."
"
Programming with monads."
"
Crying in the rain."
"
Dancing in the rain."
"
Debugging a race condition."
"
Solving problems in a hammock."
"
Sitting in traffic."
"
Drinking poison."]
sentencellama2 sentimentllama2 uncensored sentiment
Programming with Clojure.😊😊
Programming with monads.😊😒
Crying in the rain.😊😒
Dancing in the rain.😊😊
Debugging a race condition.😊😒
Solving problems in a hammock.😊😊
Sitting in traffic.😊😒
Drinking poison.😒😒

In this example, the llama2 uncensored model vastly outperforms the llama2 model. It was very difficult to even find an example where llama2 would label a sentence as "Sad" due to its training. However, the llama2 uncensored model had no problem classifying sentences as happy or sad.

More Models Options

New models with different strengths, weaknesses, capabilities, and resource requirements are becoming available regularly. As the classifier example showed, different models can perform drastically different depending on the task.

Just to give an idea, here's a short list of other models to try:

  • metharme-7b: This is an experiment to try and get a model that is usable for conversation, roleplaying and storywriting, but which can be guided using natural language like other instruct models.
  • GPT4All: GPT4All is an ecosystem to train and deploy powerful and customized large language models that run locally on consumer grade CPUs.
  • OpenLLamMa: we are releasing our public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA
  • ALMA: ALMA (Advanced Language Model-based trAnslator) is an LLM-based translation model, which adopts a new translation model paradigm: it begins with fine-tuning on monolingual data and is further optimized using high-quality parallel data. This two-step fine-tuning process ensures strong translation performance.
  • LlaMa-2 Coder: LlaMa-2 7b fine-tuned on the CodeAlpaca 20k instructions dataset by using the method QLoRA with PEFT library.

Conclusion

Given a sequence of tokens, calculate the probability that a token will come next in the sequence. This probability is calculated for all possible tokens.

LLMs really only have one basic operation which makes them easy to learn and easy to use. Having direct access to LLMs provides flexibility in cost, capability, and usage.

Next Steps

For more information on getting started, check out the guide.