Introduction to Large Language Models
Understand what LLMs are, trace their evolution from GPT-1 to today's frontier models, and learn about their capabilities and limitations.
What are LLMs?
Large Language Models (LLMs) are neural networks trained on massive text datasets that can understand and generate human language. They are built on the Transformer architecture and contain billions of parameters, enabling them to perform a wide range of language tasks.
At their core, LLMs are next-token prediction machines: given a sequence of text, they predict the most likely next word (or token). Despite this simple objective, scaling this approach to billions of parameters and trillions of tokens of training data produces models with remarkable emergent abilities.
A Brief History
| Year | Model | Parameters | Significance |
|---|---|---|---|
| 2017 | Transformer | N/A | Google's "Attention Is All You Need" paper introduced the architecture |
| 2018 | GPT-1 | 117M | OpenAI's first generative pre-trained transformer |
| 2018 | BERT | 340M | Google's bidirectional model, dominated NLP benchmarks |
| 2019 | GPT-2 | 1.5B | Showed coherent text generation, initially withheld for safety |
| 2019 | T5 | 11B | Google's text-to-text framework, unified NLP tasks |
| 2020 | GPT-3 | 175B | Demonstrated in-context learning, few-shot capabilities |
| 2023 | GPT-4 | ~1.8T* | Multimodal, near-human reasoning on many benchmarks |
| 2023 | LLaMA | 7-65B | Meta's open-weight models, sparked open-source LLM movement |
| 2023 | Claude 2 | N/A | Anthropic's model with 100K context window |
| 2024 | LLaMA 3 | 8-405B | Meta's most capable open model to date |
| 2024 | Claude 3.5 | N/A | Anthropic's Sonnet became leading model on many benchmarks |
| 2025 | Claude 4 | N/A | Anthropic's frontier model with extended thinking and agentic capabilities |
* Estimated; OpenAI has not confirmed GPT-4's parameter count.
Scale: Understanding Parameters
LLMs are categorized by their parameter count, which roughly correlates with capability:
Small (1-7B)
Phi-3 Mini, Gemma 2B, LLaMA 3 8B. Can run on consumer GPUs. Good for specific tasks, code completion, simple chat.
Medium (13-34B)
LLaMA 2 13B, Mixtral 8x7B, CodeLlama 34B. Need high-end GPU. Better reasoning, multi-turn conversations.
Large (70B+)
LLaMA 3 70B, Qwen 72B. Need multi-GPU setup. Near-frontier quality, strong at complex tasks.
Frontier (400B+)
GPT-4, Claude 4, Gemini Ultra, LLaMA 3 405B. Require large clusters. State-of-the-art across all tasks.
Capabilities
- Text generation: Writing articles, stories, emails, documentation, marketing copy.
- Reasoning: Math problems, logic puzzles, multi-step analysis, planning.
- Coding: Writing, debugging, and explaining code in dozens of programming languages.
- Translation: High-quality translation between languages, including low-resource languages.
- Summarization: Condensing long documents, papers, and conversations into concise summaries.
- Question answering: Responding to questions based on provided context or general knowledge.
- Instruction following: Performing complex multi-step tasks described in natural language.
Limitations
- Hallucinations: LLMs can generate confident but factually incorrect information. They don't "know" facts — they predict plausible text.
- Bias: Models reflect biases present in their training data, including cultural, gender, and racial biases.
- Context limits: Each model has a maximum context window (4K to 1M+ tokens). Information beyond this limit is lost.
- Knowledge cutoff: Models only know information from their training data, which has a cutoff date.
- No real understanding: LLMs manipulate patterns in text without true comprehension or consciousness.
- Cost: Running large models is expensive in terms of compute, energy, and API costs.
The Transformer Revolution
The Transformer architecture, introduced in 2017, was the breakthrough that made LLMs possible. Key innovations:
- Self-attention: Allows the model to weigh the importance of every word relative to every other word, regardless of distance.
- Parallelization: Unlike RNNs, Transformers process all tokens simultaneously, enabling massive parallelism on GPUs.
- Scaling: Performance improves predictably with more parameters, more data, and more compute (scaling laws).
Lilly Tech Systems