Start Here

The LLM Landscape Overview

Before diving into individual models, understand the categories, history, and vocabulary of the LLM ecosystem. This section explains how to read model specs and what benchmarks actually measure.

Three Categories of LLMs

Every major LLM falls into one of three categories based on how it is released and licensed:

1. Commercial Closed-Source

These models are only accessible through APIs or proprietary interfaces. The model weights, training data, and architecture details are not publicly available.

  • Examples: GPT-4o, Claude Opus 4, Gemini 2.5 Pro, Grok-2
  • Advantages: Typically highest performance, managed infrastructure, regular updates
  • Disadvantages: Vendor lock-in, data privacy concerns, ongoing costs, no customization of weights

2. Commercial Open-Weight

The model weights are released publicly, but with restrictions on commercial use or modifications. The training data and full training process are typically not disclosed.

  • Examples: Llama 3.1 405B, Gemma 2 27B, Mistral Large
  • Advantages: Can self-host, fine-tune, inspect weights, run offline
  • Disadvantages: License restrictions may apply, requires infrastructure expertise, less support

3. True Open-Source

Model weights, training code, and often training data are released under permissive licenses (Apache 2.0, MIT). Full transparency.

  • Examples: OLMo, Pythia, BLOOM, Falcon (some variants)
  • Advantages: Complete freedom, full reproducibility, community-driven improvements
  • Disadvantages: Often smaller scale, fewer resources behind development
💡
Important distinction: "Open-weight" and "open-source" are not the same. Many models marketed as "open source" (like Llama) are actually open-weight with custom licenses that impose usage restrictions. True open-source means the full stack is available under a recognized open-source license.

Timeline of Major LLM Releases

The LLM landscape has evolved rapidly. Here are the landmark releases:

Date Model Provider Significance
Jun 2020GPT-3OpenAI175B parameters; demonstrated few-shot learning at scale
Mar 2022ChinchillaDeepMindProved smaller models with more data outperform larger models
Nov 2022ChatGPT (GPT-3.5)OpenAIBrought LLMs to the mainstream; RLHF breakthrough
Feb 2023LLaMAMetaOpen-weight models ignited the open-source LLM movement
Mar 2023GPT-4OpenAIFirst multimodal GPT; major quality leap
Mar 2023Claude 1AnthropicConstitutional AI approach to safety
Jul 2023Claude 2Anthropic100K context window
Jul 2023Llama 2MetaCommercially licensable open-weight models
Dec 2023Gemini 1.0GoogleGoogle's unified multimodal model
Dec 2023Mixtral 8x7BMistralMixture-of-experts approach for efficiency
Mar 2024Claude 3AnthropicOpus/Sonnet/Haiku tiers; 200K context
Apr 2024Llama 3Meta8B and 70B models with strong benchmarks
May 2024GPT-4oOpenAIOmni model: text, vision, audio natively
Jun 2024Claude 3.5 SonnetAnthropicSonnet-class outperforming prior Opus
Jul 2024Llama 3.1 405BMetaLargest open-weight model
Sep 2024o1OpenAIChain-of-thought reasoning model
Dec 2024Gemini 2.0 FlashGoogleMultimodal with agentic capabilities
Jan 2025DeepSeek V3DeepSeekCompetitive open model at low training cost
Jan 2025o3-miniOpenAIEfficient reasoning model
Mar 2025Gemini 2.5 ProGoogleThinking model with 1M context
Apr 2025Llama 4 Scout/MaverickMetaNext-gen Llama with MoE architecture
May 2025Claude Opus 4 / Sonnet 4AnthropicLatest Claude generation

Model Size Ranges

LLM "size" is measured in parameters — the learnable weights in the neural network. Larger models generally have more capacity but require more compute:

Size Category Parameters Typical Use Hardware Needed
Tiny1B – 3BOn-device, mobile, edge, simple tasksPhone, Raspberry Pi, laptop CPU
Small7B – 9BLocal development, specific tasks, chatbotsSingle consumer GPU (8-16GB VRAM)
Medium13B – 34BGeneral purpose, good quality trade-off1-2 high-end GPUs (24-48GB VRAM)
Large65B – 90BHigh quality, complex reasoningMulti-GPU or cloud instance
Frontier200B – 1T+State-of-the-art, all capabilitiesMulti-node GPU clusters; API access
Pro tip: Parameter count alone does not determine quality. Training data quality, training approach (e.g., RLHF, DPO), and architecture choices matter enormously. A well-trained 8B model can outperform a poorly trained 70B model on many tasks.

How to Read Model Specs

When evaluating an LLM, these are the key specifications to look at:

  • Parameters: Total number of learnable weights. Indicates model capacity but not quality alone.
  • Context window: Maximum number of tokens (input + output) the model can process at once. Ranges from 4K to 2M tokens across models.
  • Max output tokens: Maximum tokens the model can generate in a single response. Often much smaller than the context window.
  • Multimodal: Whether the model can process images, audio, video, or other modalities beyond text.
  • Training cutoff: The date up to which the model's training data extends. Important for questions about recent events.
  • Pricing: Usually measured per million tokens for both input and output. Output tokens typically cost more.
  • License: Determines how you can use the model — commercial use, modification, redistribution rights.

Understanding Benchmarks

Benchmarks provide standardized ways to compare model capabilities. Here are the most commonly cited ones:

MMLU (Massive Multitask Language Understanding)

Tests knowledge across 57 subjects including STEM, humanities, social sciences, and professional domains. Scores range from 0-100%. A score of 90%+ indicates strong broad knowledge. Human expert average is roughly 89%.

HumanEval

Measures code generation ability. The model is given function signatures and docstrings and must generate correct implementations. Scored by pass@1 (percentage of problems solved correctly on the first attempt). Scores above 80% indicate strong coding ability.

GPQA (Graduate-Level Google-Proof Q&A)

Expert-level science questions that are difficult even for domain experts with internet access. Designed to test deep reasoning. Human expert accuracy is around 65%. Scores above 50% are impressive.

MT-Bench

Multi-turn conversation benchmark scored by GPT-4 as a judge. Tests the model's ability to have coherent multi-turn dialogues across categories like writing, reasoning, math, coding, extraction, STEM, and humanities. Scored 1-10.

MATH

Competition-level mathematics problems from AMC, AIME, and other math competitions. Tests genuine mathematical reasoning, not just pattern matching. Scores above 50% indicate strong math capability.

Benchmark caveats: Benchmarks are useful but imperfect. Models can be overfitted to benchmark datasets. Real-world performance can differ significantly from benchmark scores. Always test models on your specific use case. Self-reported benchmark scores from model providers should be taken with appropriate skepticism.

Token Counting Basics

LLMs process text as tokens, not characters or words. Understanding tokens is essential for estimating costs and context window usage:

  • 1 token is roughly 3/4 of a word in English
  • 1,000 tokens is approximately 750 words
  • A typical page of text is about 400-500 tokens
  • Code tends to use more tokens per "word" than natural language
  • Non-English languages often require more tokens per word