The LLM Landscape Overview
Before diving into individual models, understand the categories, history, and vocabulary of the LLM ecosystem. This section explains how to read model specs and what benchmarks actually measure.
Three Categories of LLMs
Every major LLM falls into one of three categories based on how it is released and licensed:
1. Commercial Closed-Source
These models are only accessible through APIs or proprietary interfaces. The model weights, training data, and architecture details are not publicly available.
- Examples: GPT-4o, Claude Opus 4, Gemini 2.5 Pro, Grok-2
- Advantages: Typically highest performance, managed infrastructure, regular updates
- Disadvantages: Vendor lock-in, data privacy concerns, ongoing costs, no customization of weights
2. Commercial Open-Weight
The model weights are released publicly, but with restrictions on commercial use or modifications. The training data and full training process are typically not disclosed.
- Examples: Llama 3.1 405B, Gemma 2 27B, Mistral Large
- Advantages: Can self-host, fine-tune, inspect weights, run offline
- Disadvantages: License restrictions may apply, requires infrastructure expertise, less support
3. True Open-Source
Model weights, training code, and often training data are released under permissive licenses (Apache 2.0, MIT). Full transparency.
- Examples: OLMo, Pythia, BLOOM, Falcon (some variants)
- Advantages: Complete freedom, full reproducibility, community-driven improvements
- Disadvantages: Often smaller scale, fewer resources behind development
Timeline of Major LLM Releases
The LLM landscape has evolved rapidly. Here are the landmark releases:
| Date | Model | Provider | Significance |
|---|---|---|---|
| Jun 2020 | GPT-3 | OpenAI | 175B parameters; demonstrated few-shot learning at scale |
| Mar 2022 | Chinchilla | DeepMind | Proved smaller models with more data outperform larger models |
| Nov 2022 | ChatGPT (GPT-3.5) | OpenAI | Brought LLMs to the mainstream; RLHF breakthrough |
| Feb 2023 | LLaMA | Meta | Open-weight models ignited the open-source LLM movement |
| Mar 2023 | GPT-4 | OpenAI | First multimodal GPT; major quality leap |
| Mar 2023 | Claude 1 | Anthropic | Constitutional AI approach to safety |
| Jul 2023 | Claude 2 | Anthropic | 100K context window |
| Jul 2023 | Llama 2 | Meta | Commercially licensable open-weight models |
| Dec 2023 | Gemini 1.0 | Google's unified multimodal model | |
| Dec 2023 | Mixtral 8x7B | Mistral | Mixture-of-experts approach for efficiency |
| Mar 2024 | Claude 3 | Anthropic | Opus/Sonnet/Haiku tiers; 200K context |
| Apr 2024 | Llama 3 | Meta | 8B and 70B models with strong benchmarks |
| May 2024 | GPT-4o | OpenAI | Omni model: text, vision, audio natively |
| Jun 2024 | Claude 3.5 Sonnet | Anthropic | Sonnet-class outperforming prior Opus |
| Jul 2024 | Llama 3.1 405B | Meta | Largest open-weight model |
| Sep 2024 | o1 | OpenAI | Chain-of-thought reasoning model |
| Dec 2024 | Gemini 2.0 Flash | Multimodal with agentic capabilities | |
| Jan 2025 | DeepSeek V3 | DeepSeek | Competitive open model at low training cost |
| Jan 2025 | o3-mini | OpenAI | Efficient reasoning model |
| Mar 2025 | Gemini 2.5 Pro | Thinking model with 1M context | |
| Apr 2025 | Llama 4 Scout/Maverick | Meta | Next-gen Llama with MoE architecture |
| May 2025 | Claude Opus 4 / Sonnet 4 | Anthropic | Latest Claude generation |
Model Size Ranges
LLM "size" is measured in parameters — the learnable weights in the neural network. Larger models generally have more capacity but require more compute:
| Size Category | Parameters | Typical Use | Hardware Needed |
|---|---|---|---|
| Tiny | 1B – 3B | On-device, mobile, edge, simple tasks | Phone, Raspberry Pi, laptop CPU |
| Small | 7B – 9B | Local development, specific tasks, chatbots | Single consumer GPU (8-16GB VRAM) |
| Medium | 13B – 34B | General purpose, good quality trade-off | 1-2 high-end GPUs (24-48GB VRAM) |
| Large | 65B – 90B | High quality, complex reasoning | Multi-GPU or cloud instance |
| Frontier | 200B – 1T+ | State-of-the-art, all capabilities | Multi-node GPU clusters; API access |
How to Read Model Specs
When evaluating an LLM, these are the key specifications to look at:
- Parameters: Total number of learnable weights. Indicates model capacity but not quality alone.
- Context window: Maximum number of tokens (input + output) the model can process at once. Ranges from 4K to 2M tokens across models.
- Max output tokens: Maximum tokens the model can generate in a single response. Often much smaller than the context window.
- Multimodal: Whether the model can process images, audio, video, or other modalities beyond text.
- Training cutoff: The date up to which the model's training data extends. Important for questions about recent events.
- Pricing: Usually measured per million tokens for both input and output. Output tokens typically cost more.
- License: Determines how you can use the model — commercial use, modification, redistribution rights.
Understanding Benchmarks
Benchmarks provide standardized ways to compare model capabilities. Here are the most commonly cited ones:
MMLU (Massive Multitask Language Understanding)
Tests knowledge across 57 subjects including STEM, humanities, social sciences, and professional domains. Scores range from 0-100%. A score of 90%+ indicates strong broad knowledge. Human expert average is roughly 89%.
HumanEval
Measures code generation ability. The model is given function signatures and docstrings and must generate correct implementations. Scored by pass@1 (percentage of problems solved correctly on the first attempt). Scores above 80% indicate strong coding ability.
GPQA (Graduate-Level Google-Proof Q&A)
Expert-level science questions that are difficult even for domain experts with internet access. Designed to test deep reasoning. Human expert accuracy is around 65%. Scores above 50% are impressive.
MT-Bench
Multi-turn conversation benchmark scored by GPT-4 as a judge. Tests the model's ability to have coherent multi-turn dialogues across categories like writing, reasoning, math, coding, extraction, STEM, and humanities. Scored 1-10.
MATH
Competition-level mathematics problems from AMC, AIME, and other math competitions. Tests genuine mathematical reasoning, not just pattern matching. Scores above 50% indicate strong math capability.
Token Counting Basics
LLMs process text as tokens, not characters or words. Understanding tokens is essential for estimating costs and context window usage:
- 1 token is roughly 3/4 of a word in English
- 1,000 tokens is approximately 750 words
- A typical page of text is about 400-500 tokens
- Code tends to use more tokens per "word" than natural language
- Non-English languages often require more tokens per word
Lilly Tech Systems