Transformers & Large Language Models
The transformer architecture revolutionized NLP in 2017. Since then, it has given rise to increasingly powerful language models that can understand and generate human-like text.
The Transformer Revolution
The Transformer was introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. It replaced recurrent neural networks (RNNs) with a self-attention mechanism that processes all positions in a sequence simultaneously, enabling massive parallelization and better handling of long-range dependencies.
Key innovations of the Transformer:
- Self-Attention: Each token attends to every other token in the sequence, learning which words are most relevant to each other.
- Multi-Head Attention: Multiple attention heads capture different types of relationships (syntactic, semantic, positional).
- Positional Encoding: Since there is no recurrence, position information is added via sinusoidal or learned embeddings.
- Parallelization: Unlike RNNs, all positions are processed simultaneously during training, making transformers much faster.
BERT: Bidirectional Understanding
BERT (Bidirectional Encoder Representations from Transformers) was released by Google in 2018. It uses only the encoder part of the transformer and reads text in both directions simultaneously.
How BERT Is Trained
- Masked Language Modeling (MLM): Randomly masks 15% of tokens and trains the model to predict them from context.
- Next Sentence Prediction (NSP): Trains the model to understand relationships between sentence pairs.
from transformers import pipeline unmasker = pipeline("fill-mask", model="bert-base-uncased") result = unmasker("NLP is a branch of [MASK] intelligence.") for r in result[:3]: print(f"{r['token_str']:15s} score: {r['score']:.3f}") # artificial score: 0.892 # human score: 0.034 # military score: 0.012
GPT: Autoregressive Generation
GPT (Generative Pre-trained Transformer) by OpenAI uses only the decoder part of the transformer. It generates text left-to-right, predicting one token at a time based on all previous tokens.
| Model | Architecture | Direction | Best For |
|---|---|---|---|
| BERT | Encoder only | Bidirectional | Understanding: classification, NER, QA |
| GPT | Decoder only | Left-to-right | Generation: text, code, conversation |
| T5 | Encoder-decoder | Both | Text-to-text: translation, summarization |
T5: Text-to-Text Framework
T5 (Text-to-Text Transfer Transformer) by Google treats every NLP task as converting one text string to another. Classification becomes "classify: [text]" producing "positive" or "negative." Translation becomes "translate English to French: [text]" producing the French translation.
Large Language Models (LLMs)
LLMs are transformer models trained on massive datasets with billions of parameters. They demonstrate emergent abilities that smaller models lack:
| Model | Creator | Key Features |
|---|---|---|
| GPT-4 | OpenAI | Multimodal (text + images), strong reasoning, 128K context |
| Claude | Anthropic | Safety-focused, 200K context, strong instruction following |
| Gemini | Multimodal, long context, integrated with Google services | |
| Llama | Meta | Open-source, various sizes, strong community support |
Fine-Tuning vs Prompting
There are two main ways to adapt a pretrained model to your specific task:
Fine-Tuning
Train the model further on your specific dataset. This updates the model's weights and creates a specialized model.
- Best when you have labeled task-specific data
- Produces the highest task-specific performance
- Requires GPU resources and training infrastructure
Prompting
Guide the model's behavior through carefully crafted instructions without changing the model's weights.
- No training required — works immediately
- Flexible and easy to iterate
- May not match fine-tuned performance on specialized tasks
Few-Shot and Zero-Shot Learning
LLMs can perform tasks they were never explicitly trained for:
- Zero-shot: The model performs a task with no examples, guided only by instructions. Example: "Classify this review as positive or negative: [review]"
- Few-shot: The model receives a few examples in the prompt before performing the task. This often significantly improves performance.
- One-shot: A special case of few-shot with exactly one example.
# Few-shot sentiment classification prompt """ Classify the sentiment of each review. Review: "This restaurant has the best pasta I've ever had!" Sentiment: Positive Review: "Waited 2 hours for cold food. Never again." Sentiment: Negative Review: "The service was friendly but the food was mediocre." Sentiment: Mixed Review: "Absolutely delightful experience from start to finish." Sentiment: """ # The model will predict: Positive
RAG (Retrieval Augmented Generation)
RAG combines the power of LLMs with external knowledge retrieval. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents from a knowledge base and includes them in the prompt.
Query
The user asks a question or provides a prompt.
Retrieve
A retriever searches a document database (using vector similarity) to find relevant passages.
Augment
The retrieved passages are added to the prompt as context.
Generate
The LLM generates an answer grounded in the retrieved context.
Lilly Tech Systems