Meta Llama Models
Meta's Llama family is the most influential open-weight model series, powering thousands of applications and fine-tunes. From the original Llama to the latest Llama 4 with mixture-of-experts architecture.
Llama 4 Series
Meta's latest generation introduces a Mixture-of-Experts (MoE) architecture, enabling massive total parameter counts while keeping inference efficient by only activating a subset of parameters per token.
| Model | Released | Total Params | Active Params | Context | License |
|---|---|---|---|---|---|
| Llama 4 Scout | Apr 2025 | 109B (16 experts) | 17B | 10M | Llama 4 Community License |
| Llama 4 Maverick | Apr 2025 | 400B (128 experts) | 17B | 1M | Llama 4 Community License |
Llama 4 Scout
A 16-expert MoE model with an industry-leading 10 million token context window. Designed for applications requiring massive context, such as processing entire repositories or long document collections.
- Best for: Long-context applications, document analysis, codebase processing
- Key features: 10M context window, multimodal (text + images), MoE architecture, fits on a single H100 node
- Where to run: Cloud providers (AWS, Azure, GCP), Together AI, Fireworks AI, Groq
Llama 4 Maverick
A larger 128-expert MoE model offering stronger performance on reasoning and coding tasks. Despite 400B total parameters, the active parameter count per token is only 17B, enabling efficient inference.
- Best for: Complex reasoning, coding, multilingual tasks, high-quality generation
- Key features: 128 experts, 1M context, multimodal, strong multilingual performance
- Where to run: Cloud providers, dedicated inference endpoints
Llama 3.3
| Model | Released | Parameters | Context | License |
|---|---|---|---|---|
| Llama 3.3 70B | Dec 2024 | 70B | 128K | Llama 3.3 Community License |
A refined 70B model that matches the performance of the much larger Llama 3.1 405B on many benchmarks. Represents a significant efficiency improvement — getting 405B-level quality from a 70B model.
- Best for: High-quality open-weight deployment where 405B is too expensive to run
- Where to run: 2x A100 80GB, cloud providers, Ollama, vLLM
Llama 3.2 Series
Introduced lightweight and multimodal models to the Llama family.
| Model | Released | Parameters | Context | Multimodal |
|---|---|---|---|---|
| Llama 3.2 90B Vision | Sep 2024 | 90B | 128K | Text + Vision |
| Llama 3.2 11B Vision | Sep 2024 | 11B | 128K | Text + Vision |
| Llama 3.2 3B | Sep 2024 | 3B | 128K | Text only |
| Llama 3.2 1B | Sep 2024 | 1B | 128K | Text only |
Llama 3.2 Vision Models (90B, 11B)
The first multimodal Llama models. Can process images alongside text for tasks like image captioning, visual Q&A, and document understanding.
- Best for: Multimodal applications, image understanding, document OCR
- 11B: Runs on a single GPU, good for development and lightweight deployment
- 90B: Highest quality multimodal open-weight model
Llama 3.2 Lightweight Models (3B, 1B)
Tiny models designed for on-device and edge deployment. Can run on mobile phones and IoT devices.
- Best for: Mobile apps, edge computing, IoT, privacy-sensitive on-device inference
- Where to run: Phones, laptops, Raspberry Pi, any device with 2-4GB RAM
Llama 3.1 Series
| Model | Released | Parameters | Context | License |
|---|---|---|---|---|
| Llama 3.1 405B | Jul 2024 | 405B | 128K | Llama 3.1 Community License |
| Llama 3.1 70B | Jul 2024 | 70B | 128K | Llama 3.1 Community License |
| Llama 3.1 8B | Jul 2024 | 8B | 128K | Llama 3.1 Community License |
Llama 3.1 405B
The largest open-weight dense model ever released. Competitive with GPT-4o and Claude 3.5 Sonnet at launch. A landmark for the open model community.
- Best for: Synthetic data generation, model distillation, highest quality open-weight inference
- Where to run: 8x A100/H100 GPUs or cloud providers
Llama 3 Series
| Model | Released | Parameters | Context |
|---|---|---|---|
| Llama 3 70B | Apr 2024 | 70B | 8K |
| Llama 3 8B | Apr 2024 | 8B | 8K |
The Llama 3 base models with 8K context. Superseded by Llama 3.1 which extended context to 128K and improved quality.
Llama 2 Series (Legacy)
| Model | Released | Parameters | Context |
|---|---|---|---|
| Llama 2 70B | Jul 2023 | 70B | 4K |
| Llama 2 13B | Jul 2023 | 13B | 4K |
| Llama 2 7B | Jul 2023 | 7B | 4K |
The first commercially licensable Llama models. Sparked a massive wave of fine-tuning and community development. Still widely used as a baseline for research.
Code Llama
| Model | Released | Parameters | Context | Specialization |
|---|---|---|---|---|
| Code Llama | Aug 2023 | 7B / 13B / 34B / 70B | 16K (100K for 7B/13B) | General code |
| Code Llama Python | Aug 2023 | 7B / 13B / 34B / 70B | 16K | Python-specific |
| Code Llama Instruct | Aug 2023 | 7B / 13B / 34B / 70B | 16K | Instruction-tuned for chat |
Code-specialized models fine-tuned from Llama 2. Support fill-in-the-middle (code completion) and long context for code. Largely superseded by Llama 3+ models which have strong native coding abilities.
Where to Run Llama Models
- Local: Ollama, llama.cpp, LM Studio, GPT4All
- Cloud inference: Together AI, Fireworks AI, Groq, Anyscale, Replicate
- Cloud hosting: AWS SageMaker, Azure ML, Google Cloud Vertex AI
- Fine-tuning: Hugging Face, Axolotl, Unsloth, Modal
Lilly Tech Systems