Open Source LLM Models
A guide to the best open-weight and open-source LLMs you can download, run locally, fine-tune, and deploy. Covers model rankings, licensing, quantization formats, and community ecosystem.
Top Open-Weight LLMs Ranked
Based on the Hugging Face Open LLM Leaderboard and community consensus, these are the strongest open-weight models by category:
General Purpose
| Rank | Model | Parameters | License | Strengths |
|---|---|---|---|---|
| 1 | Llama 4 Maverick | 400B MoE | Llama 4 License | Strongest open model overall |
| 2 | DeepSeek V3 | 671B MoE | MIT | Best cost-efficiency, strong reasoning |
| 3 | Qwen 2.5 72B | 72B | Apache 2.0 | Best multilingual, strong coding |
| 4 | Llama 3.3 70B | 70B | Llama 3.3 License | 405B quality in 70B size |
| 5 | Mixtral 8x22B | 141B MoE | Apache 2.0 | Efficient MoE, permissive license |
Code-Specialized
| Model | Parameters | License | Strengths |
|---|---|---|---|
| Qwen 2.5 Coder 32B | 32B | Apache 2.0 | Best open code model, multi-language |
| DeepSeek Coder V2 | 236B MoE | MIT | Strong code generation and understanding |
| Codestral | 22B | MNPL (non-commercial) | Fast, 80+ languages, fill-in-the-middle |
| CodeGemma 7B | 7B | Gemma License | Lightweight code model |
Multimodal (Vision + Text)
| Model | Parameters | License | Strengths |
|---|---|---|---|
| Llama 3.2 90B Vision | 90B | Llama License | Strongest open multimodal model |
| Llama 3.2 11B Vision | 11B | Llama License | Efficient multimodal, single GPU |
| Pixtral 12B | 12B | Apache 2.0 | Permissive license, good quality |
| LLaVA 1.6 | 7B / 13B / 34B | Apache 2.0 | Pioneer in open multimodal |
Small / Edge Models
| Model | Parameters | License | Strengths |
|---|---|---|---|
| Phi-4 | 14B | MIT | Best reasoning for size |
| Gemma 2 9B | 9B | Gemma License | Strong overall quality |
| Llama 3.2 3B | 3B | Llama License | Great for mobile/edge |
| Qwen 2.5 7B | 7B | Apache 2.0 | Best 7B model, multilingual |
| Phi-3 Mini | 3.8B | MIT | Extremely capable for 3.8B |
How to Download and Run Open Models
There are several ways to run open-weight models locally or on your own infrastructure:
Desktop/Laptop Tools
- Ollama: The easiest way to run models locally. Simple CLI:
ollama run llama3.2. Handles downloading, quantization, and serving. macOS, Linux, Windows. - LM Studio: GUI application for running local models. Drag-and-drop model loading, built-in chat interface, OpenAI-compatible API server.
- GPT4All: Desktop application focused on privacy. Runs models entirely offline. Simple interface for non-technical users.
- llama.cpp: The foundational C++ library for efficient CPU and GPU inference. Powers Ollama, LM Studio, and many other tools. Supports GGUF format.
Server/Production Tools
- vLLM: High-throughput serving engine. PagedAttention for efficient memory. Best for production deployments with multiple concurrent users.
- TGI (Text Generation Inference): Hugging Face's production serving solution. Supports continuous batching, quantization, and tensor parallelism.
- TensorRT-LLM: NVIDIA's optimized inference engine. Maximum performance on NVIDIA GPUs.
- SGLang: Fast serving framework with RadixAttention for efficient prefix caching.
Cloud Inference Providers
- Together AI: Largest selection of open models, competitive pricing
- Fireworks AI: Fast inference, good developer experience
- Groq: Custom LPU hardware for extremely fast inference
- Replicate: Simple API, pay-per-use, many model options
Licensing Comparison
Understanding licenses is critical before using open-weight models in production:
| License | Commercial Use | Modify | Redistribute | Key Restrictions |
|---|---|---|---|---|
| Apache 2.0 | Yes | Yes | Yes | None significant. Most permissive. |
| MIT | Yes | Yes | Yes | None significant. Very permissive. |
| Llama License | Yes* | Yes | Yes | *700M+ MAU companies must request license from Meta |
| Gemma License | Yes | Yes | Yes | Cannot use outputs to train competing models |
| MNPL (Mistral) | No | Yes | Yes | Non-commercial only (used for Codestral) |
| CC BY-NC 4.0 | No | Yes | Yes | Non-commercial only |
Quantization Formats Explained
Quantization reduces model size and memory requirements by using lower-precision numbers for weights. This makes it possible to run large models on consumer hardware.
GGUF (GPT-Generated Unified Format)
The standard format for llama.cpp, Ollama, and LM Studio. Supports CPU and GPU inference. Most popular for local running.
- Q4_K_M: 4-bit quantization, good balance of quality and size. ~4.5 GB for a 7B model. Recommended starting point.
- Q5_K_M: 5-bit, slightly better quality, ~5 GB for 7B. Good when you have the RAM.
- Q8_0: 8-bit, near-original quality, ~7.5 GB for 7B.
- Q2_K: 2-bit, significant quality loss but smallest size. Only for constrained environments.
GPTQ
GPU-focused quantization format. Requires NVIDIA GPU. Fast inference via exllama/exllamav2 kernels. Available in 3-bit, 4-bit, and 8-bit variants.
AWQ (Activation-Aware Weight Quantization)
Newer GPU quantization method that preserves salient weights based on activation patterns. Often slightly better quality than GPTQ at the same bit width. Supported by vLLM and TGI.
BitsAndBytes (bnb)
Quantization library integrated with Hugging Face Transformers. Supports 4-bit (NF4) and 8-bit quantization. Easy to use but slower than GPTQ/AWQ for inference.
Community Fine-Tunes
The open-source community produces thousands of fine-tuned variants of base models, optimized for specific tasks:
- Chat/Instruct fine-tunes: Models tuned for conversational use (e.g., Hermes, OpenChat, Neural-Chat)
- Code fine-tunes: Models specialized for programming (e.g., WizardCoder, Phind-CodeLlama)
- Creative writing: Models tuned for storytelling and creative content (e.g., Nous-Hermes, Mythomax)
- Domain-specific: Medical (Meditron), legal (SaulLM), finance (FinMA), science (SciGLM)
- Merged models: Community-created merges combining strengths of multiple fine-tunes using techniques like TIES, DARE, and SLERP
Lilly Tech Systems