Introduction to Tiktokenizer
Understand what tokens are, why they matter for AI development, and how Tiktokenizer helps you visualize and count them.
What is Tiktokenizer?
Tiktokenizer (tiktokenizer.vercel.app) is a free, web-based tool that lets you visualize how AI models break text into tokens. You paste in any text, select a tokenizer, and instantly see each token highlighted in a different color with a total token count.
It is built on top of tiktoken, the tokenizer library created by OpenAI, and supports multiple tokenization schemes used by different AI models.
What Are Tokens?
Tokens are the fundamental units that AI language models work with. They are not the same as words. Instead, tokens are subword units — pieces of text that the model's tokenizer has learned to recognize as meaningful chunks.
# Input text: "Hello, how are you doing today?" # Tokenized (GPT-4o / cl100k_base): ["Hello", ",", " how", " are", " you", " doing", " today", "?"] # Result: 8 tokens for 7 words # A longer word gets split into sub-tokens: "tokenization" → ["token", "ization"] # 2 tokens # Common words are single tokens: "the" → ["the"] # 1 token
Key things to understand about tokens:
- Common English words are usually one token (
the,hello,code). - Less common words may be split into multiple tokens (
tokenization=token+ization). - Spaces are often included at the beginning of tokens (
howvshow). - Punctuation and special characters are typically their own tokens.
- Numbers can be split in unexpected ways (
2024might be202+4).
Why Token Counting Matters
Understanding tokens is essential for three key reasons:
1. Context Windows
Every AI model has a maximum number of tokens it can process in a single request (its context window). This includes both your input and the model's output. If your input exceeds the context window, the API call will fail.
2. Pricing
AI APIs charge per token. You pay for both input tokens (your prompt) and output tokens (the model's response). Knowing your token count lets you estimate costs before making API calls.
3. Prompt Optimization
By understanding how text is tokenized, you can write more token-efficient prompts. This saves money and lets you fit more useful content within the context window.
How Different Models Tokenize
Different AI models use different tokenization methods. The two main approaches are:
BPE (Byte Pair Encoding)
Used by OpenAI models (GPT-4, GPT-4o) and many others. BPE starts with individual characters and iteratively merges the most frequent pairs into new tokens. This creates a vocabulary of common subword units.
SentencePiece
Used by models like LLaMA, Mistral, and Gemma. SentencePiece works directly on raw text (including spaces) and can use BPE or unigram language models internally. It often handles multilingual text differently from tiktoken.
Token Limits by Model
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| GPT-4o | 128K tokens | 16K tokens | o200k_base |
| GPT-4o mini | 128K tokens | 16K tokens | o200k_base |
| Claude Sonnet 4 | 200K tokens | 64K tokens | Claude tokenizer |
| Claude Opus 4 | 200K tokens | 64K tokens | Claude tokenizer |
| Gemini 2.5 Pro | 1M tokens | 65K tokens | Gemini tokenizer |
| Llama 4 Scout | 512K tokens | Variable | SentencePiece |
| Mistral Large | 128K tokens | Variable | SentencePiece |
Lilly Tech Systems