AI-generated Key Takeaways
- 
          Large language models (LLMs) predict sequences of tokens and outperform previous models due to their vast number of parameters and broader context gathering capabilities. 
- 
          Transformers, the leading architecture for LLMs, utilize encoders to process input and decoders to generate output, often for tasks like translation. 
- 
          Self-attention, a core concept in Transformers, allows the model to weigh the importance of different words in relation to each other, enhancing context understanding. 
- 
          LLMs are trained using masked predictions on massive datasets, enabling them to learn patterns and generate text based on probabilities. 
- 
          While LLMs offer benefits like clear text generation, they also present challenges like hallucinations, computational costs, and potential biases. 
A newer technology, large language models (LLMs) predict a token or sequence of tokens, sometimes many paragraphs worth of predicted tokens. Remember that a token can be a word, a subword (a subset of a word), or even a single character. LLMs make much better predictions than N-gram language models or recurrent neural networks because:
- LLMs contain far more parameters than recurrent models.
- LLMs gather far more context.
This section introduces the most successful and widely used architecture for building LLMs: the Transformer.
What's a Transformer?
Transformers are the state-of-the-art architecture for a wide variety of language model applications, such as translation:
 
  
Full transformers consist of an encoder and a decoder:
- An encoder converts input text into an intermediate representation. An encoder is an enormous neural net.
- A decoder converts that intermediate representation into useful text. A decoder is also an enormous neural net.
For example, in a translator:
- The encoder processes the input text (for example, an English sentence) into some intermediate representation.
- The decoder converts that intermediate representation into output text (for example, the equivalent French sentence).
 
  
What is self-attention?
To enhance context, Transformers rely heavily on a concept called self-attention. Effectively, on behalf of each token of input, self-attention asks the following question:
"How much does each other token of input affect the interpretation of this token?"
The "self" in "self-attention" refers to the input sequence. Some attention mechanisms weigh relations of input tokens to tokens in an output sequence like a translation or to tokens in some other sequence. But self-attention only weighs the importance of relations between tokens in the input sequence.
To simplify matters, assume that each token is a word and the complete context is only a single sentence. Consider the following sentence:
The animal didn't cross the street because it was too tired.
The preceding sentence contains eleven words. Each of the eleven words is paying attention to the other ten, wondering how much each of those ten words matters to itself. For example, notice that the sentence contains the pronoun it. Pronouns are often ambiguous. The pronoun it typically refers to a recent noun or noun phrase, but in the example sentence, which recent noun does it refer to—the animal or the street?
The self-attention mechanism determines the relevance of each nearby word to the pronoun it. Figure 3 shows the results—the bluer the line, the more important that word is to the pronoun it. That is, animal is more important than street to the pronoun it.
 
  
Conversely, suppose the final word in the sentence changes as follows:
The animal didn't cross the street because it was too wide.
In this revised sentence, self-attention would hopefully rate street as more relevant than animal to the pronoun it.
Some self-attention mechanisms are bidirectional, meaning that they calculate relevance scores for tokens preceding and following the word being attended to. For example, in Figure 3, notice that words on both sides of it are examined. So, a bidirectional self-attention mechanism can gather context from words on either side of the word being attended to. By contrast, a unidirectional self-attention mechanism can only gather context from words on one side of the word being attended to. Bidirectional self-attention is especially useful for generating representations of whole sequences, while applications that generate sequences token-by-token require unidirectional self-attention. For this reason, encoders use bidirectional self-attention, while decoders use unidirectional.
What is multi-head multi-layer self-attention?
Each self-attention layer is typically comprised of multiple self-attention heads. The output of a layer is a mathematical operation (for example, weighted average or dot product) of the output of the different heads.
Since the parameters of each head are initialized to random values, different heads can learn different relationships between each word being attended to and the nearby words. For example, the self-attention head described in the previous section focused on determining which noun the pronoun it referred to. However, other self-attention heads within the same layer might learn the grammatical relevance of each word to every other word, or learn other interactions.
A complete transformer model stacks multiple self-attention layers on top of one another. The output from the previous layer becomes the input for the next. This stacking allows the model to build progressively more complex and abstract understandings of the text. While earlier layers might focus on basic syntax, deeper layers can integrate that information to grasp more nuanced concepts like sentiment, context, and thematic links across the entire input.
Why are Transformers so large?
Transformers contain hundreds of billion or even trillions of parameters. This course has generally recommended building models with a smaller number of parameters over those with a larger number of parameters. After all, a model with a smaller number of parameters uses fewer resources to make predictions than a model with a larger number of parameters. However, research shows that Transformers with more parameters consistently outperform Transformers with fewer parameters.
But how does an LLM generate text?
You've seen how researchers train LLMs to predict a missing word or two, and you might be unimpressed. After all, predicting a word or two is essentially the autocomplete feature built into various text, email, and authoring software. You might be wondering how LLMs can generate sentences or paragraphs or haikus about arbitrage.
In fact, LLMs are essentially autocomplete mechanisms that can automatically predict (complete) thousands of tokens. For example, consider a sentence followed by a masked sentence:
My dog, Max, knows how to perform many traditional dog tricks. ___ (masked sentence)
An LLM can generate probabilities for the masked sentence, including:
| Probability | Word(s) | 
|---|---|
| 3.1% | For example, he can sit, stay, and roll over. | 
| 2.9% | For example, he knows how to sit, stay, and roll over. | 
A sufficiently large LLM can generate probabilities for paragraphs and entire essays. You can think of a user's questions to an LLM as the "given" sentence followed by an imaginary mask. For example:
User's question: What is the easiest trick to teach a dog? LLM's response: ___
The LLM generates probabilities for various possible responses.
As another example, an LLM trained on a massive number of mathematical "word problems" can give the appearance of doing sophisticated mathematical reasoning. However, those LLMs are basically just autocompleting a word problem prompt.
Benefits of LLMs
LLMs can generate clear, easy-to-understand text for a wide variety of target audiences. LLMs can make predictions on tasks they are explicitly trained on. Some researchers claim that LLMs can also make predictions for input they were not explicitly trained on, but other researchers have refuted this claim.
Problems with LLMs
Training an LLM entails many problems, including:
- Gathering an enormous training set.
- Consuming multiple months and enormous computational resources and electricity.
- Solving parallelism challenges.
Using LLMs to infer predictions causes the following problems:
- LLMs hallucinate, meaning their predictions often contain mistakes.
- LLMs consume enormous amounts of computational resources and electricity. Training LLMs on larger datasets typically reduces the amount of resources required for inference, though the larger training sets incur more training resources.
- Like all ML models, LLMs can exhibit all sorts of bias.