Reference

Open Source LLM Models

A guide to the best open-weight and open-source LLMs you can download, run locally, fine-tune, and deploy. Covers model rankings, licensing, quantization formats, and community ecosystem.

Top Open-Weight LLMs Ranked

Based on the Hugging Face Open LLM Leaderboard and community consensus, these are the strongest open-weight models by category:

General Purpose

RankModelParametersLicenseStrengths
1Llama 4 Maverick400B MoELlama 4 LicenseStrongest open model overall
2DeepSeek V3671B MoEMITBest cost-efficiency, strong reasoning
3Qwen 2.5 72B72BApache 2.0Best multilingual, strong coding
4Llama 3.3 70B70BLlama 3.3 License405B quality in 70B size
5Mixtral 8x22B141B MoEApache 2.0Efficient MoE, permissive license

Code-Specialized

ModelParametersLicenseStrengths
Qwen 2.5 Coder 32B32BApache 2.0Best open code model, multi-language
DeepSeek Coder V2236B MoEMITStrong code generation and understanding
Codestral22BMNPL (non-commercial)Fast, 80+ languages, fill-in-the-middle
CodeGemma 7B7BGemma LicenseLightweight code model

Multimodal (Vision + Text)

ModelParametersLicenseStrengths
Llama 3.2 90B Vision90BLlama LicenseStrongest open multimodal model
Llama 3.2 11B Vision11BLlama LicenseEfficient multimodal, single GPU
Pixtral 12B12BApache 2.0Permissive license, good quality
LLaVA 1.67B / 13B / 34BApache 2.0Pioneer in open multimodal

Small / Edge Models

ModelParametersLicenseStrengths
Phi-414BMITBest reasoning for size
Gemma 2 9B9BGemma LicenseStrong overall quality
Llama 3.2 3B3BLlama LicenseGreat for mobile/edge
Qwen 2.5 7B7BApache 2.0Best 7B model, multilingual
Phi-3 Mini3.8BMITExtremely capable for 3.8B

How to Download and Run Open Models

There are several ways to run open-weight models locally or on your own infrastructure:

Desktop/Laptop Tools

  • Ollama: The easiest way to run models locally. Simple CLI: ollama run llama3.2. Handles downloading, quantization, and serving. macOS, Linux, Windows.
  • LM Studio: GUI application for running local models. Drag-and-drop model loading, built-in chat interface, OpenAI-compatible API server.
  • GPT4All: Desktop application focused on privacy. Runs models entirely offline. Simple interface for non-technical users.
  • llama.cpp: The foundational C++ library for efficient CPU and GPU inference. Powers Ollama, LM Studio, and many other tools. Supports GGUF format.

Server/Production Tools

  • vLLM: High-throughput serving engine. PagedAttention for efficient memory. Best for production deployments with multiple concurrent users.
  • TGI (Text Generation Inference): Hugging Face's production serving solution. Supports continuous batching, quantization, and tensor parallelism.
  • TensorRT-LLM: NVIDIA's optimized inference engine. Maximum performance on NVIDIA GPUs.
  • SGLang: Fast serving framework with RadixAttention for efficient prefix caching.

Cloud Inference Providers

  • Together AI: Largest selection of open models, competitive pricing
  • Fireworks AI: Fast inference, good developer experience
  • Groq: Custom LPU hardware for extremely fast inference
  • Replicate: Simple API, pay-per-use, many model options

Licensing Comparison

Understanding licenses is critical before using open-weight models in production:

LicenseCommercial UseModifyRedistributeKey Restrictions
Apache 2.0YesYesYesNone significant. Most permissive.
MITYesYesYesNone significant. Very permissive.
Llama LicenseYes*YesYes*700M+ MAU companies must request license from Meta
Gemma LicenseYesYesYesCannot use outputs to train competing models
MNPL (Mistral)NoYesYesNon-commercial only (used for Codestral)
CC BY-NC 4.0NoYesYesNon-commercial only
Legal note: Always read the full license text before deploying a model commercially. "Open-weight" does not mean "open-source." Many popular models (Llama, Gemma) have custom licenses with specific restrictions. When in doubt, consult legal counsel.

Quantization Formats Explained

Quantization reduces model size and memory requirements by using lower-precision numbers for weights. This makes it possible to run large models on consumer hardware.

GGUF (GPT-Generated Unified Format)

The standard format for llama.cpp, Ollama, and LM Studio. Supports CPU and GPU inference. Most popular for local running.

  • Q4_K_M: 4-bit quantization, good balance of quality and size. ~4.5 GB for a 7B model. Recommended starting point.
  • Q5_K_M: 5-bit, slightly better quality, ~5 GB for 7B. Good when you have the RAM.
  • Q8_0: 8-bit, near-original quality, ~7.5 GB for 7B.
  • Q2_K: 2-bit, significant quality loss but smallest size. Only for constrained environments.

GPTQ

GPU-focused quantization format. Requires NVIDIA GPU. Fast inference via exllama/exllamav2 kernels. Available in 3-bit, 4-bit, and 8-bit variants.

AWQ (Activation-Aware Weight Quantization)

Newer GPU quantization method that preserves salient weights based on activation patterns. Often slightly better quality than GPTQ at the same bit width. Supported by vLLM and TGI.

BitsAndBytes (bnb)

Quantization library integrated with Hugging Face Transformers. Supports 4-bit (NF4) and 8-bit quantization. Easy to use but slower than GPTQ/AWQ for inference.

Community Fine-Tunes

The open-source community produces thousands of fine-tuned variants of base models, optimized for specific tasks:

  • Chat/Instruct fine-tunes: Models tuned for conversational use (e.g., Hermes, OpenChat, Neural-Chat)
  • Code fine-tunes: Models specialized for programming (e.g., WizardCoder, Phind-CodeLlama)
  • Creative writing: Models tuned for storytelling and creative content (e.g., Nous-Hermes, Mythomax)
  • Domain-specific: Medical (Meditron), legal (SaulLM), finance (FinMA), science (SciGLM)
  • Merged models: Community-created merges combining strengths of multiple fine-tunes using techniques like TIES, DARE, and SLERP
Finding models: Hugging Face Hub is the primary source for open models. Use the leaderboard, filter by task, and check download counts and community ratings to find the best model for your needs.