Beginner

Introduction to Large Language Models

Understand what LLMs are, trace their evolution from GPT-1 to today's frontier models, and learn about their capabilities and limitations.

What are LLMs?

Large Language Models (LLMs) are neural networks trained on massive text datasets that can understand and generate human language. They are built on the Transformer architecture and contain billions of parameters, enabling them to perform a wide range of language tasks.

At their core, LLMs are next-token prediction machines: given a sequence of text, they predict the most likely next word (or token). Despite this simple objective, scaling this approach to billions of parameters and trillions of tokens of training data produces models with remarkable emergent abilities.

A Brief History

YearModelParametersSignificance
2017TransformerN/AGoogle's "Attention Is All You Need" paper introduced the architecture
2018GPT-1117MOpenAI's first generative pre-trained transformer
2018BERT340MGoogle's bidirectional model, dominated NLP benchmarks
2019GPT-21.5BShowed coherent text generation, initially withheld for safety
2019T511BGoogle's text-to-text framework, unified NLP tasks
2020GPT-3175BDemonstrated in-context learning, few-shot capabilities
2023GPT-4~1.8T*Multimodal, near-human reasoning on many benchmarks
2023LLaMA7-65BMeta's open-weight models, sparked open-source LLM movement
2023Claude 2N/AAnthropic's model with 100K context window
2024LLaMA 38-405BMeta's most capable open model to date
2024Claude 3.5N/AAnthropic's Sonnet became leading model on many benchmarks
2025Claude 4N/AAnthropic's frontier model with extended thinking and agentic capabilities

* Estimated; OpenAI has not confirmed GPT-4's parameter count.

Scale: Understanding Parameters

LLMs are categorized by their parameter count, which roughly correlates with capability:

Small (1-7B)

Phi-3 Mini, Gemma 2B, LLaMA 3 8B. Can run on consumer GPUs. Good for specific tasks, code completion, simple chat.

Medium (13-34B)

LLaMA 2 13B, Mixtral 8x7B, CodeLlama 34B. Need high-end GPU. Better reasoning, multi-turn conversations.

Large (70B+)

LLaMA 3 70B, Qwen 72B. Need multi-GPU setup. Near-frontier quality, strong at complex tasks.

Frontier (400B+)

GPT-4, Claude 4, Gemini Ultra, LLaMA 3 405B. Require large clusters. State-of-the-art across all tasks.

Capabilities

  • Text generation: Writing articles, stories, emails, documentation, marketing copy.
  • Reasoning: Math problems, logic puzzles, multi-step analysis, planning.
  • Coding: Writing, debugging, and explaining code in dozens of programming languages.
  • Translation: High-quality translation between languages, including low-resource languages.
  • Summarization: Condensing long documents, papers, and conversations into concise summaries.
  • Question answering: Responding to questions based on provided context or general knowledge.
  • Instruction following: Performing complex multi-step tasks described in natural language.

Limitations

  • Hallucinations: LLMs can generate confident but factually incorrect information. They don't "know" facts — they predict plausible text.
  • Bias: Models reflect biases present in their training data, including cultural, gender, and racial biases.
  • Context limits: Each model has a maximum context window (4K to 1M+ tokens). Information beyond this limit is lost.
  • Knowledge cutoff: Models only know information from their training data, which has a cutoff date.
  • No real understanding: LLMs manipulate patterns in text without true comprehension or consciousness.
  • Cost: Running large models is expensive in terms of compute, energy, and API costs.

The Transformer Revolution

The Transformer architecture, introduced in 2017, was the breakthrough that made LLMs possible. Key innovations:

  • Self-attention: Allows the model to weigh the importance of every word relative to every other word, regardless of distance.
  • Parallelization: Unlike RNNs, Transformers process all tokens simultaneously, enabling massive parallelism on GPUs.
  • Scaling: Performance improves predictably with more parameters, more data, and more compute (scaling laws).
💡
Key insight: The Transformer didn't just improve NLP — it enabled a new paradigm where a single pre-trained model can be adapted to virtually any language task through prompting or fine-tuning, replacing dozens of task-specific models.