Start Here

The LLM Landscape Overview

Before diving into individual models, understand the categories, history, and vocabulary of the LLM ecosystem. This section explains how to read model specs and what benchmarks actually measure.

Three Categories of LLMs

Every major LLM falls into one of three categories based on how it is released and licensed:

1. Commercial Closed-Source

These models are only accessible through APIs or proprietary interfaces. The model weights, training data, and architecture details are not publicly available.

Examples: GPT-4o, Claude Opus 4, Gemini 2.5 Pro, Grok-2
Advantages: Typically highest performance, managed infrastructure, regular updates
Disadvantages: Vendor lock-in, data privacy concerns, ongoing costs, no customization of weights

2. Commercial Open-Weight

The model weights are released publicly, but with restrictions on commercial use or modifications. The training data and full training process are typically not disclosed.

Examples: Llama 3.1 405B, Gemma 2 27B, Mistral Large
Advantages: Can self-host, fine-tune, inspect weights, run offline
Disadvantages: License restrictions may apply, requires infrastructure expertise, less support

3. True Open-Source

Model weights, training code, and often training data are released under permissive licenses (Apache 2.0, MIT). Full transparency.

Examples: OLMo, Pythia, BLOOM, Falcon (some variants)
Advantages: Complete freedom, full reproducibility, community-driven improvements
Disadvantages: Often smaller scale, fewer resources behind development

💡

Important distinction: "Open-weight" and "open-source" are not the same. Many models marketed as "open source" (like Llama) are actually open-weight with custom licenses that impose usage restrictions. True open-source means the full stack is available under a recognized open-source license.

Timeline of Major LLM Releases

The LLM landscape has evolved rapidly. Here are the landmark releases:

Date	Model	Provider	Significance
Jun 2020	GPT-3	OpenAI	175B parameters; demonstrated few-shot learning at scale
Mar 2022	Chinchilla	DeepMind	Proved smaller models with more data outperform larger models
Nov 2022	ChatGPT (GPT-3.5)	OpenAI	Brought LLMs to the mainstream; RLHF breakthrough
Feb 2023	LLaMA	Meta	Open-weight models ignited the open-source LLM movement
Mar 2023	GPT-4	OpenAI	First multimodal GPT; major quality leap
Mar 2023	Claude 1	Anthropic	Constitutional AI approach to safety
Jul 2023	Claude 2	Anthropic	100K context window
Jul 2023	Llama 2	Meta	Commercially licensable open-weight models
Dec 2023	Gemini 1.0	Google	Google's unified multimodal model
Dec 2023	Mixtral 8x7B	Mistral	Mixture-of-experts approach for efficiency
Mar 2024	Claude 3	Anthropic	Opus/Sonnet/Haiku tiers; 200K context
Apr 2024	Llama 3	Meta	8B and 70B models with strong benchmarks
May 2024	GPT-4o	OpenAI	Omni model: text, vision, audio natively
Jun 2024	Claude 3.5 Sonnet	Anthropic	Sonnet-class outperforming prior Opus
Jul 2024	Llama 3.1 405B	Meta	Largest open-weight model
Sep 2024	o1	OpenAI	Chain-of-thought reasoning model
Dec 2024	Gemini 2.0 Flash	Google	Multimodal with agentic capabilities
Jan 2025	DeepSeek V3	DeepSeek	Competitive open model at low training cost
Jan 2025	o3-mini	OpenAI	Efficient reasoning model
Mar 2025	Gemini 2.5 Pro	Google	Thinking model with 1M context
Apr 2025	Llama 4 Scout/Maverick	Meta	Next-gen Llama with MoE architecture
May 2025	Claude Opus 4 / Sonnet 4	Anthropic	Latest Claude generation

Model Size Ranges

LLM "size" is measured in parameters — the learnable weights in the neural network. Larger models generally have more capacity but require more compute:

Size Category	Parameters	Typical Use	Hardware Needed
Tiny	1B – 3B	On-device, mobile, edge, simple tasks	Phone, Raspberry Pi, laptop CPU
Small	7B – 9B	Local development, specific tasks, chatbots	Single consumer GPU (8-16GB VRAM)
Medium	13B – 34B	General purpose, good quality trade-off	1-2 high-end GPUs (24-48GB VRAM)
Large	65B – 90B	High quality, complex reasoning	Multi-GPU or cloud instance
Frontier	200B – 1T+	State-of-the-art, all capabilities	Multi-node GPU clusters; API access

✅

Pro tip: Parameter count alone does not determine quality. Training data quality, training approach (e.g., RLHF, DPO), and architecture choices matter enormously. A well-trained 8B model can outperform a poorly trained 70B model on many tasks.

How to Read Model Specs

When evaluating an LLM, these are the key specifications to look at:

Parameters: Total number of learnable weights. Indicates model capacity but not quality alone.
Context window: Maximum number of tokens (input + output) the model can process at once. Ranges from 4K to 2M tokens across models.
Max output tokens: Maximum tokens the model can generate in a single response. Often much smaller than the context window.
Multimodal: Whether the model can process images, audio, video, or other modalities beyond text.
Training cutoff: The date up to which the model's training data extends. Important for questions about recent events.
Pricing: Usually measured per million tokens for both input and output. Output tokens typically cost more.
License: Determines how you can use the model — commercial use, modification, redistribution rights.

Understanding Benchmarks

Benchmarks provide standardized ways to compare model capabilities. Here are the most commonly cited ones:

MMLU (Massive Multitask Language Understanding)

Tests knowledge across 57 subjects including STEM, humanities, social sciences, and professional domains. Scores range from 0-100%. A score of 90%+ indicates strong broad knowledge. Human expert average is roughly 89%.

HumanEval

Measures code generation ability. The model is given function signatures and docstrings and must generate correct implementations. Scored by pass@1 (percentage of problems solved correctly on the first attempt). Scores above 80% indicate strong coding ability.

GPQA (Graduate-Level Google-Proof Q&A)

Expert-level science questions that are difficult even for domain experts with internet access. Designed to test deep reasoning. Human expert accuracy is around 65%. Scores above 50% are impressive.

MT-Bench

Multi-turn conversation benchmark scored by GPT-4 as a judge. Tests the model's ability to have coherent multi-turn dialogues across categories like writing, reasoning, math, coding, extraction, STEM, and humanities. Scored 1-10.

MATH

Competition-level mathematics problems from AMC, AIME, and other math competitions. Tests genuine mathematical reasoning, not just pattern matching. Scores above 50% indicate strong math capability.

⚠

Benchmark caveats: Benchmarks are useful but imperfect. Models can be overfitted to benchmark datasets. Real-world performance can differ significantly from benchmark scores. Always test models on your specific use case. Self-reported benchmark scores from model providers should be taken with appropriate skepticism.

Token Counting Basics

LLMs process text as tokens, not characters or words. Understanding tokens is essential for estimating costs and context window usage:

1 token is roughly 3/4 of a word in English
1,000 tokens is approximately 750 words
A typical page of text is about 400-500 tokens
Code tends to use more tokens per "word" than natural language
Non-English languages often require more tokens per word

Next → OpenAI Models