Reference

Open Source LLM Models

A guide to the best open-weight and open-source LLMs you can download, run locally, fine-tune, and deploy. Covers model rankings, licensing, quantization formats, and community ecosystem.

Top Open-Weight LLMs Ranked

Based on the Hugging Face Open LLM Leaderboard and community consensus, these are the strongest open-weight models by category:

General Purpose

Rank	Model	Parameters	License	Strengths
1	Llama 4 Maverick	400B MoE	Llama 4 License	Strongest open model overall
2	DeepSeek V3	671B MoE	MIT	Best cost-efficiency, strong reasoning
3	Qwen 2.5 72B	72B	Apache 2.0	Best multilingual, strong coding
4	Llama 3.3 70B	70B	Llama 3.3 License	405B quality in 70B size
5	Mixtral 8x22B	141B MoE	Apache 2.0	Efficient MoE, permissive license

Code-Specialized

Model	Parameters	License	Strengths
Qwen 2.5 Coder 32B	32B	Apache 2.0	Best open code model, multi-language
DeepSeek Coder V2	236B MoE	MIT	Strong code generation and understanding
Codestral	22B	MNPL (non-commercial)	Fast, 80+ languages, fill-in-the-middle
CodeGemma 7B	7B	Gemma License	Lightweight code model

Multimodal (Vision + Text)

Model	Parameters	License	Strengths
Llama 3.2 90B Vision	90B	Llama License	Strongest open multimodal model
Llama 3.2 11B Vision	11B	Llama License	Efficient multimodal, single GPU
Pixtral 12B	12B	Apache 2.0	Permissive license, good quality
LLaVA 1.6	7B / 13B / 34B	Apache 2.0	Pioneer in open multimodal

Small / Edge Models

Model	Parameters	License	Strengths
Phi-4	14B	MIT	Best reasoning for size
Gemma 2 9B	9B	Gemma License	Strong overall quality
Llama 3.2 3B	3B	Llama License	Great for mobile/edge
Qwen 2.5 7B	7B	Apache 2.0	Best 7B model, multilingual
Phi-3 Mini	3.8B	MIT	Extremely capable for 3.8B

How to Download and Run Open Models

There are several ways to run open-weight models locally or on your own infrastructure:

Desktop/Laptop Tools

Ollama: The easiest way to run models locally. Simple CLI: ollama run llama3.2. Handles downloading, quantization, and serving. macOS, Linux, Windows.
LM Studio: GUI application for running local models. Drag-and-drop model loading, built-in chat interface, OpenAI-compatible API server.
GPT4All: Desktop application focused on privacy. Runs models entirely offline. Simple interface for non-technical users.
llama.cpp: The foundational C++ library for efficient CPU and GPU inference. Powers Ollama, LM Studio, and many other tools. Supports GGUF format.

Server/Production Tools

vLLM: High-throughput serving engine. PagedAttention for efficient memory. Best for production deployments with multiple concurrent users.
TGI (Text Generation Inference): Hugging Face's production serving solution. Supports continuous batching, quantization, and tensor parallelism.
TensorRT-LLM: NVIDIA's optimized inference engine. Maximum performance on NVIDIA GPUs.
SGLang: Fast serving framework with RadixAttention for efficient prefix caching.

Cloud Inference Providers

Together AI: Largest selection of open models, competitive pricing
Fireworks AI: Fast inference, good developer experience
Groq: Custom LPU hardware for extremely fast inference
Replicate: Simple API, pay-per-use, many model options

Licensing Comparison

Understanding licenses is critical before using open-weight models in production:

License	Commercial Use	Modify	Redistribute	Key Restrictions
Apache 2.0	Yes	Yes	Yes	None significant. Most permissive.
MIT	Yes	Yes	Yes	None significant. Very permissive.
Llama License	Yes*	Yes	Yes	*700M+ MAU companies must request license from Meta
Gemma License	Yes	Yes	Yes	Cannot use outputs to train competing models
MNPL (Mistral)	No	Yes	Yes	Non-commercial only (used for Codestral)
CC BY-NC 4.0	No	Yes	Yes	Non-commercial only

⚠

Legal note: Always read the full license text before deploying a model commercially. "Open-weight" does not mean "open-source." Many popular models (Llama, Gemma) have custom licenses with specific restrictions. When in doubt, consult legal counsel.

Quantization Formats Explained

Quantization reduces model size and memory requirements by using lower-precision numbers for weights. This makes it possible to run large models on consumer hardware.

GGUF (GPT-Generated Unified Format)

The standard format for llama.cpp, Ollama, and LM Studio. Supports CPU and GPU inference. Most popular for local running.

Q4_K_M: 4-bit quantization, good balance of quality and size. ~4.5 GB for a 7B model. Recommended starting point.
Q5_K_M: 5-bit, slightly better quality, ~5 GB for 7B. Good when you have the RAM.
Q8_0: 8-bit, near-original quality, ~7.5 GB for 7B.
Q2_K: 2-bit, significant quality loss but smallest size. Only for constrained environments.

GPTQ

GPU-focused quantization format. Requires NVIDIA GPU. Fast inference via exllama/exllamav2 kernels. Available in 3-bit, 4-bit, and 8-bit variants.

AWQ (Activation-Aware Weight Quantization)

Newer GPU quantization method that preserves salient weights based on activation patterns. Often slightly better quality than GPTQ at the same bit width. Supported by vLLM and TGI.

BitsAndBytes (bnb)

Quantization library integrated with Hugging Face Transformers. Supports 4-bit (NF4) and 8-bit quantization. Easy to use but slower than GPTQ/AWQ for inference.

Community Fine-Tunes

The open-source community produces thousands of fine-tuned variants of base models, optimized for specific tasks:

Chat/Instruct fine-tunes: Models tuned for conversational use (e.g., Hermes, OpenChat, Neural-Chat)
Code fine-tunes: Models specialized for programming (e.g., WizardCoder, Phind-CodeLlama)
Creative writing: Models tuned for storytelling and creative content (e.g., Nous-Hermes, Mythomax)
Domain-specific: Medical (Meditron), legal (SaulLM), finance (FinMA), science (SciGLM)
Merged models: Community-created merges combining strengths of multiple fine-tunes using techniques like TIES, DARE, and SLERP

✅

Finding models: Hugging Face Hub is the primary source for open models. Use the leaderboard, filter by task, and check download counts and community ratings to find the best model for your needs.

← Previous Other Providers Next → Comparison