Text Processing Fundamentals
These 12 questions cover the foundational text processing concepts that appear in nearly every NLP interview. Even if the role focuses on LLMs, interviewers test these to gauge your depth of understanding.
Q1: What is tokenization and why is it the first step in any NLP pipeline?
Tokenization is the process of converting raw text into a sequence of discrete units (tokens) that a model can process. It is the first step because neural networks cannot operate on raw strings — they need numerical input.
There are three levels of tokenization:
- Word-level: Split on whitespace/punctuation. Simple but creates massive vocabularies (100K+ words) and cannot handle unseen words (OOV problem).
- Character-level: Each character is a token. Tiny vocabulary (~256) but sequences become very long, making it hard for models to learn word-level semantics.
- Subword-level: The modern standard. Splits words into frequent subword units (e.g., "unhappiness" becomes ["un", "happi", "ness"]). Balances vocabulary size (32K–64K) with coverage of rare words.
The choice of tokenizer directly affects model performance, vocabulary size, sequence length, and the ability to handle multilingual text.
Q2: Explain BPE (Byte Pair Encoding). How does it work step by step?
BPE is a subword tokenization algorithm used by GPT-2, GPT-3, GPT-4, and LLaMA. It builds a vocabulary by iteratively merging the most frequent pair of adjacent tokens.
Training algorithm:
- Start with a vocabulary of all individual characters in the corpus
- Count the frequency of every adjacent pair of tokens across the corpus
- Merge the most frequent pair into a single new token (e.g., "t" + "h" becomes "th")
- Add the new token to the vocabulary
- Repeat steps 2–4 until the desired vocabulary size is reached (e.g., 50,257 for GPT-2)
Encoding (at inference): Given a new word, apply the learned merge rules in order. For example, "lowest" might become ["low", "est"] if those merges were learned.
Key advantage: Common words remain single tokens ("the", "and"), while rare words are split into meaningful subwords. This gives a good balance between vocabulary size and sequence length.
Q3: How does WordPiece differ from BPE? Why did BERT use WordPiece?
WordPiece and BPE are both subword tokenization algorithms, but they differ in how they select which pair to merge:
- BPE: Merges the pair with the highest raw frequency count
- WordPiece: Merges the pair that maximizes the likelihood of the training corpus. Specifically, it picks the pair (A, B) where score = freq(AB) / (freq(A) * freq(B)) is highest. This is essentially mutual information.
The practical difference is that WordPiece tends to prefer merging rare subwords that always appear together (high mutual information), while BPE prefers merging common pairs regardless of context.
BERT used WordPiece (vocabulary of 30,522 tokens) because Google originally developed it for their translation systems where maximizing corpus likelihood produced better translations. WordPiece tokens are prefixed with "##" when they are continuations of a word (e.g., "playing" becomes ["play", "##ing"]).
SentencePiece is a related tool (not an algorithm) that implements both BPE and a unigram language model approach. It treats the input as raw bytes, making it language-agnostic and ideal for multilingual models like T5 and mBART.
Q4: What is TF-IDF and when would you still use it in 2025?
TF-IDF (Term Frequency – Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document within a corpus.
- TF(t, d) = (count of term t in document d) / (total terms in d)
- IDF(t) = log(N / df(t)), where N is total documents and df(t) is documents containing term t
- TF-IDF(t, d) = TF(t, d) * IDF(t)
Words that appear frequently in one document but rarely across the corpus get high TF-IDF scores (discriminative). Words like "the" get low scores because high DF crushes the IDF component.
When to use TF-IDF in 2025:
- BM25 search: Most production search engines (Elasticsearch, Solr) still use BM25, which is a refined version of TF-IDF. Even RAG systems use BM25 as a first-stage retriever alongside vector search.
- Feature engineering: TF-IDF features combined with gradient boosted trees often beat fine-tuned transformers on tabular + text datasets with limited training data.
- Baselines: Always establish a TF-IDF + logistic regression baseline before using transformers. It trains in seconds and often achieves 85–90% of transformer performance.
- Interpretability: When stakeholders need to understand why a document was retrieved or classified, TF-IDF provides transparent feature importance.
Q5: Explain Word2Vec. What are CBOW and Skip-gram?
Word2Vec (Mikolov et al., 2013) learns dense vector representations of words such that semantically similar words have similar vectors. It introduced the idea that word relationships can be captured as vector arithmetic: king - man + woman ≈ queen.
Two architectures:
- CBOW (Continuous Bag of Words): Predicts the center word given surrounding context words. Faster to train, works better on frequent words.
- Skip-gram: Predicts surrounding context words given the center word. Slower but works better on rare words and smaller datasets.
Training trick — Negative Sampling: Instead of computing softmax over the entire vocabulary (expensive), sample 5–20 random "negative" words and train a binary classifier to distinguish real context words from random ones. This makes training O(k) instead of O(|V|).
Limitations: Each word gets exactly one vector regardless of context ("bank" has the same embedding whether it means riverbank or financial bank). This polysemy problem is solved by contextualized embeddings (ELMo, BERT).
Q6: How do GloVe embeddings differ from Word2Vec?
GloVe (Global Vectors, Pennington et al., 2014) takes a different approach from Word2Vec:
- Word2Vec: Uses local context windows. Trains on (center, context) word pairs using stochastic gradient descent on a predictive objective.
- GloVe: First builds a global word co-occurrence matrix from the entire corpus, then factorizes this matrix to produce embeddings. The loss function ensures that the dot product of two word vectors equals the log of their co-occurrence probability.
Key formula: w_i · w_j + b_i + b_j = log(X_ij), where X_ij is the co-occurrence count.
Practical difference: GloVe leverages global corpus statistics (the full co-occurrence matrix), while Word2Vec only sees local windows. In practice, both produce similar quality embeddings. GloVe is often faster to train because the co-occurrence matrix is computed once and then factorized.
FastText (Bojanowski et al., 2017) extends Word2Vec by representing each word as a bag of character n-grams. "where" becomes {"<wh", "whe", "her", "ere", "re>"}. This means FastText can generate embeddings for out-of-vocabulary words by summing their n-gram vectors — a major advantage for morphologically rich languages.
Q7: What is the difference between static and contextualized embeddings?
Static embeddings (Word2Vec, GloVe, FastText) assign a single fixed vector to each word regardless of context. The word "bank" has the same representation in "river bank" and "bank account."
Contextualized embeddings (ELMo, BERT, GPT) generate different vectors for the same word depending on surrounding context. "Bank" gets different representations in different sentences because the embedding is computed by the model at inference time based on the full input.
| Property | Static (Word2Vec/GloVe) | Contextualized (BERT/GPT) |
|---|---|---|
| Polysemy handling | One vector per word | Different vector per usage |
| Computation | Lookup table (O(1)) | Forward pass through transformer (O(n^2)) |
| Storage | ~1 GB for 1M words | Model weights (110M–175B params) |
| Quality | Good for word similarity | Superior for all downstream tasks |
| Training data needed | Works with 1M+ tokens | Needs billions of tokens |
When to use static embeddings: Resource-constrained environments, very large vocabulary lookups (e.g., product search with millions of items), or as features for non-neural models.
Q8: Walk me through a complete text preprocessing pipeline. What steps do you include and which do you skip for transformer models?
Classical preprocessing pipeline (for TF-IDF / classical ML):
- Lowercasing: Convert to lowercase for case-insensitive matching
- HTML/URL removal: Strip tags, URLs, email addresses
- Unicode normalization: NFKC normalization to handle accented characters
- Punctuation removal: Remove or replace with spaces
- Stopword removal: Remove "the", "is", "and" (high frequency, low information)
- Stemming/Lemmatization: Reduce words to root form ("running" to "run")
- Tokenization: Split into word tokens
For transformer models, skip most of these:
- Do NOT lowercase: BERT uses cased models that distinguish "Apple" (company) from "apple" (fruit)
- Do NOT remove stopwords: Transformers need full context for attention
- Do NOT stem/lemmatize: The subword tokenizer handles morphological variation
- DO clean HTML/URLs unless the task requires them
- DO handle encoding issues (UTF-8, special characters)
- DO truncate/pad to model's max sequence length
Key insight: Over-preprocessing for transformers actually hurts performance because you remove signal the model can learn from. The tokenizer and the model handle most normalization internally.
Q9: What is the vocabulary size trade-off in subword tokenization?
Vocabulary size creates a direct tension between two goals:
- Smaller vocabulary (e.g., 8K tokens): Fewer embedding parameters, smaller model size. But common words get split into many subwords ("international" might become ["inter", "na", "tion", "al"]), making sequences longer and increasing compute cost.
- Larger vocabulary (e.g., 128K tokens): More words stay as single tokens (shorter sequences, faster inference). But the embedding matrix grows (128K * d_model * 4 bytes), and rare tokens get poor representations due to limited training examples.
Common choices:
- GPT-2: 50,257 tokens
- BERT: 30,522 tokens (WordPiece)
- LLaMA: 32,000 tokens
- GPT-4: ~100,000 tokens
- Gemini: 256,000 tokens
Multilingual models need larger vocabularies because they must cover multiple scripts and languages. Models like mBERT (120K) and XLM-R (250K) use significantly larger vocabularies.
The fertility metric measures average tokens per word. For English, BPE with 32K vocabulary typically gives fertility ~1.3. For morphologically rich languages like Turkish, fertility can be 2.0+, meaning sequences are much longer.
Q10: How does BPE handle unknown or rare words? What about misspellings?
This is one of the key advantages of subword tokenization over word-level tokenization:
- Unknown words: BPE never produces an UNK token. Any word, no matter how rare, can be decomposed into known subword tokens. In the worst case, it falls back to individual characters (or bytes in byte-level BPE).
- Rare words: Rare words get split into more subwords. "Deoxyribonucleic" might become ["De", "oxy", "rib", "on", "uc", "le", "ic"]. The model can compose meaning from these known subwords.
- Misspellings: "recieve" (misspelling of "receive") would be tokenized differently from the correct spelling, but the overlapping subwords still carry partial meaning. However, models are not robust to misspellings by design — this is a known limitation.
Byte-level BPE (used by GPT-2+) operates on raw bytes (256 base tokens) instead of Unicode characters. This guarantees that any byte sequence can be tokenized, including emojis, code, and arbitrary binary data. The downside is slightly longer sequences for non-ASCII text.
Follow-up defense: If asked "how would you make a model robust to typos?", mention data augmentation with synthetic misspellings during training, character-level models as a pre-filter, or spell-check preprocessing.
Q11: What is the embedding layer in a neural network? How is it different from Word2Vec?
An embedding layer is a trainable lookup table that maps token IDs to dense vectors. Given vocabulary size V and embedding dimension d, it is a matrix of shape (V, d). Looking up a token is just indexing into this matrix — no computation required.
Key differences from Word2Vec:
- Word2Vec: Trained as a standalone objective (predict context words). The resulting vectors are then used as fixed features in downstream models.
- Embedding layer: Initialized randomly (or from Word2Vec) and then fine-tuned end-to-end with the downstream task. The embeddings adapt to the specific task's needs.
In transformers like BERT, the embedding layer has three components summed together:
- Token embeddings: The standard lookup table (30,522 * 768 for BERT-base)
- Position embeddings: Learned vectors for each position (512 * 768 for BERT-base)
- Segment embeddings: Distinguish sentence A from sentence B (2 * 768 for BERT-base)
The final input embedding = token_emb + position_emb + segment_emb, followed by LayerNorm and dropout.
Q12: How would you build a simple text classifier from scratch? Walk through the full pipeline.
I would approach this in two ways depending on constraints:
Approach 1: Classical (fast, interpretable, small data):
# 1. Preprocess
texts = [clean(t) for t in raw_texts] # lowercase, remove HTML, normalize
# 2. Vectorize with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=50000, ngram_range=(1, 2))
X = vectorizer.fit_transform(texts)
# 3. Train classifier
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train, y_train)
# 4. Evaluate
from sklearn.metrics import classification_report
print(classification_report(y_test, model.predict(X_test)))
Approach 2: Transformer-based (higher accuracy, more data):
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=num_classes
)
# Tokenize
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True, max_length=512)
dataset = dataset.map(tokenize, batched=True)
# Fine-tune
trainer = Trainer(model=model, args=TrainingArguments(
output_dir="./results", num_train_epochs=3,
per_device_train_batch_size=16, learning_rate=2e-5,
evaluation_strategy="epoch"
), train_dataset=train_set, eval_dataset=val_set)
trainer.train()
Decision framework: Start with Approach 1 as a baseline. If accuracy is insufficient and you have 5K+ labeled examples, switch to Approach 2. If you have fewer than 1K examples, consider few-shot learning with an LLM instead.
Key Takeaways
- Know the differences between BPE, WordPiece, and SentencePiece — these are high-frequency interview questions
- Understand the static vs contextualized embedding distinction and when each is appropriate
- TF-IDF is not outdated — it powers BM25 in production search and serves as a strong baseline
- For transformers, skip most classical preprocessing (stopwords, stemming, lowercasing)
- Always be ready to implement a text classifier both ways: TF-IDF + LogReg and fine-tuned transformer
Lilly Tech Systems