Intermediate

Transformers & Large Language Models

The transformer architecture revolutionized NLP in 2017. Since then, it has given rise to increasingly powerful language models that can understand and generate human-like text.

The Transformer Revolution

The Transformer was introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. It replaced recurrent neural networks (RNNs) with a self-attention mechanism that processes all positions in a sequence simultaneously, enabling massive parallelization and better handling of long-range dependencies.

Key innovations of the Transformer:

  • Self-Attention: Each token attends to every other token in the sequence, learning which words are most relevant to each other.
  • Multi-Head Attention: Multiple attention heads capture different types of relationships (syntactic, semantic, positional).
  • Positional Encoding: Since there is no recurrence, position information is added via sinusoidal or learned embeddings.
  • Parallelization: Unlike RNNs, all positions are processed simultaneously during training, making transformers much faster.

BERT: Bidirectional Understanding

BERT (Bidirectional Encoder Representations from Transformers) was released by Google in 2018. It uses only the encoder part of the transformer and reads text in both directions simultaneously.

How BERT Is Trained

  • Masked Language Modeling (MLM): Randomly masks 15% of tokens and trains the model to predict them from context.
  • Next Sentence Prediction (NSP): Trains the model to understand relationships between sentence pairs.
Python - BERT Fill-Mask
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")

result = unmasker("NLP is a branch of [MASK] intelligence.")
for r in result[:3]:
    print(f"{r['token_str']:15s} score: {r['score']:.3f}")
# artificial      score: 0.892
# human           score: 0.034
# military        score: 0.012

GPT: Autoregressive Generation

GPT (Generative Pre-trained Transformer) by OpenAI uses only the decoder part of the transformer. It generates text left-to-right, predicting one token at a time based on all previous tokens.

ModelArchitectureDirectionBest For
BERTEncoder onlyBidirectionalUnderstanding: classification, NER, QA
GPTDecoder onlyLeft-to-rightGeneration: text, code, conversation
T5Encoder-decoderBothText-to-text: translation, summarization

T5: Text-to-Text Framework

T5 (Text-to-Text Transfer Transformer) by Google treats every NLP task as converting one text string to another. Classification becomes "classify: [text]" producing "positive" or "negative." Translation becomes "translate English to French: [text]" producing the French translation.

Large Language Models (LLMs)

LLMs are transformer models trained on massive datasets with billions of parameters. They demonstrate emergent abilities that smaller models lack:

ModelCreatorKey Features
GPT-4OpenAIMultimodal (text + images), strong reasoning, 128K context
ClaudeAnthropicSafety-focused, 200K context, strong instruction following
GeminiGoogleMultimodal, long context, integrated with Google services
LlamaMetaOpen-source, various sizes, strong community support

Fine-Tuning vs Prompting

There are two main ways to adapt a pretrained model to your specific task:

Fine-Tuning

Train the model further on your specific dataset. This updates the model's weights and creates a specialized model.

  • Best when you have labeled task-specific data
  • Produces the highest task-specific performance
  • Requires GPU resources and training infrastructure

Prompting

Guide the model's behavior through carefully crafted instructions without changing the model's weights.

  • No training required — works immediately
  • Flexible and easy to iterate
  • May not match fine-tuned performance on specialized tasks

Few-Shot and Zero-Shot Learning

LLMs can perform tasks they were never explicitly trained for:

  • Zero-shot: The model performs a task with no examples, guided only by instructions. Example: "Classify this review as positive or negative: [review]"
  • Few-shot: The model receives a few examples in the prompt before performing the task. This often significantly improves performance.
  • One-shot: A special case of few-shot with exactly one example.
Few-Shot Prompt Example
# Few-shot sentiment classification prompt
"""
Classify the sentiment of each review.

Review: "This restaurant has the best pasta I've ever had!"
Sentiment: Positive

Review: "Waited 2 hours for cold food. Never again."
Sentiment: Negative

Review: "The service was friendly but the food was mediocre."
Sentiment: Mixed

Review: "Absolutely delightful experience from start to finish."
Sentiment:
"""
# The model will predict: Positive

RAG (Retrieval Augmented Generation)

RAG combines the power of LLMs with external knowledge retrieval. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents from a knowledge base and includes them in the prompt.

  1. Query

    The user asks a question or provides a prompt.

  2. Retrieve

    A retriever searches a document database (using vector similarity) to find relevant passages.

  3. Augment

    The retrieved passages are added to the prompt as context.

  4. Generate

    The LLM generates an answer grounded in the retrieved context.

💡
Why RAG? RAG reduces hallucinations by grounding responses in actual documents. It also allows models to access up-to-date information beyond their training data cutoff.
Key takeaway: Transformers unified NLP under a single architecture. BERT excels at understanding, GPT at generation, and T5 at text-to-text tasks. Modern LLMs combine these capabilities at massive scale, and techniques like RAG extend their usefulness even further.