Beginner

Introduction to Large Language Models

Understand what LLMs are, trace their evolution from GPT-1 to today's frontier models, and learn about their capabilities and limitations.

What are LLMs?

Large Language Models (LLMs) are neural networks trained on massive text datasets that can understand and generate human language. They are built on the Transformer architecture and contain billions of parameters, enabling them to perform a wide range of language tasks.

At their core, LLMs are next-token prediction machines: given a sequence of text, they predict the most likely next word (or token). Despite this simple objective, scaling this approach to billions of parameters and trillions of tokens of training data produces models with remarkable emergent abilities.

A Brief History

Year	Model	Parameters	Significance
2017	Transformer	N/A	Google's "Attention Is All You Need" paper introduced the architecture
2018	GPT-1	117M	OpenAI's first generative pre-trained transformer
2018	BERT	340M	Google's bidirectional model, dominated NLP benchmarks
2019	GPT-2	1.5B	Showed coherent text generation, initially withheld for safety
2019	T5	11B	Google's text-to-text framework, unified NLP tasks
2020	GPT-3	175B	Demonstrated in-context learning, few-shot capabilities
2023	GPT-4	~1.8T*	Multimodal, near-human reasoning on many benchmarks
2023	LLaMA	7-65B	Meta's open-weight models, sparked open-source LLM movement
2023	Claude 2	N/A	Anthropic's model with 100K context window
2024	LLaMA 3	8-405B	Meta's most capable open model to date
2024	Claude 3.5	N/A	Anthropic's Sonnet became leading model on many benchmarks
2025	Claude 4	N/A	Anthropic's frontier model with extended thinking and agentic capabilities

* Estimated; OpenAI has not confirmed GPT-4's parameter count.

Scale: Understanding Parameters

LLMs are categorized by their parameter count, which roughly correlates with capability:

Small (1-7B)

Phi-3 Mini, Gemma 2B, LLaMA 3 8B. Can run on consumer GPUs. Good for specific tasks, code completion, simple chat.

Medium (13-34B)

LLaMA 2 13B, Mixtral 8x7B, CodeLlama 34B. Need high-end GPU. Better reasoning, multi-turn conversations.

Large (70B+)

LLaMA 3 70B, Qwen 72B. Need multi-GPU setup. Near-frontier quality, strong at complex tasks.

Frontier (400B+)

GPT-4, Claude 4, Gemini Ultra, LLaMA 3 405B. Require large clusters. State-of-the-art across all tasks.

Capabilities

Text generation: Writing articles, stories, emails, documentation, marketing copy.
Reasoning: Math problems, logic puzzles, multi-step analysis, planning.
Coding: Writing, debugging, and explaining code in dozens of programming languages.
Translation: High-quality translation between languages, including low-resource languages.
Summarization: Condensing long documents, papers, and conversations into concise summaries.
Question answering: Responding to questions based on provided context or general knowledge.
Instruction following: Performing complex multi-step tasks described in natural language.

Limitations

⚠

Hallucinations: LLMs can generate confident but factually incorrect information. They don't "know" facts — they predict plausible text.
Bias: Models reflect biases present in their training data, including cultural, gender, and racial biases.
Context limits: Each model has a maximum context window (4K to 1M+ tokens). Information beyond this limit is lost.
Knowledge cutoff: Models only know information from their training data, which has a cutoff date.
No real understanding: LLMs manipulate patterns in text without true comprehension or consciousness.
Cost: Running large models is expensive in terms of compute, energy, and API costs.

The Transformer Revolution

The Transformer architecture, introduced in 2017, was the breakthrough that made LLMs possible. Key innovations:

Self-attention: Allows the model to weigh the importance of every word relative to every other word, regardless of distance.
Parallelization: Unlike RNNs, Transformers process all tokens simultaneously, enabling massive parallelism on GPUs.
Scaling: Performance improves predictably with more parameters, more data, and more compute (scaling laws).

💡

Key insight: The Transformer didn't just improve NLP — it enabled a new paradigm where a single pre-trained model can be adapted to virtually any language task through prompting or fine-tuning, replacing dozens of task-specific models.

Next → How LLMs Work