Intermediate

Category 3: NLP & Sequences

The NLP category tests your ability to build text classification models using tokenization, word embeddings, and recurrent neural networks. You must convert raw text into sequences, pad them to uniform length, and train LSTM models to achieve target accuracy.

The NLP Pipeline for the Exam

Every NLP exam task follows the same pipeline: tokenize text, convert to sequences, pad sequences, and feed into an Embedding + LSTM model. Master this pipeline and you can handle any NLP task on the exam.

# The standard NLP pipeline for the TensorFlow exam

# Step 1: Tokenize text -> convert words to integer IDs
# Step 2: Convert sentences to sequences of integers
# Step 3: Pad sequences to uniform length
# Step 4: Feed into Embedding -> LSTM -> Dense model

# Raw text -> [Tokenizer] -> Integer sequences -> [pad_sequences] -> Padded arrays
# Example: "I love AI" -> [2, 15, 8] -> [0, 0, 0, 2, 15, 8]  (padded to length 6)

Practice Model 1: Text Classification with LSTM

Build a sentiment classification model using the IMDB dataset. This pattern is directly applicable to exam tasks.

import tensorflow as tf
import numpy as np

# ---- Load IMDB dataset ----
VOCAB_SIZE = 10000
MAX_LENGTH = 120
EMBEDDING_DIM = 16

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(
    num_words=VOCAB_SIZE
)

# ---- Pad sequences to uniform length ----
x_train = tf.keras.utils.pad_sequences(
    x_train,
    maxlen=MAX_LENGTH,
    padding='post',      # Pad at the end
    truncating='post'    # Truncate at the end
)
x_test = tf.keras.utils.pad_sequences(
    x_test,
    maxlen=MAX_LENGTH,
    padding='post',
    truncating='post'
)

print(f"Training shape: {x_train.shape}")  # (25000, 120)

# ---- Build LSTM model ----
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_LENGTH),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')  # Binary sentiment
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

history = model.fit(
    x_train, y_train,
    validation_data=(x_test, y_test),
    epochs=10,
    batch_size=64,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_accuracy', patience=2,
            restore_best_weights=True
        )
    ]
)

model.save('imdb_sentiment.h5')

Practice Model 2: Tokenizing Raw Text

When the exam gives you raw text data (not pre-tokenized like IMDB), you must tokenize it yourself. This is a critical skill.

import tensorflow as tf
import numpy as np

# ---- Simulated raw text dataset ----
sentences = [
    "I love machine learning and artificial intelligence",
    "Deep learning is a subset of machine learning",
    "Neural networks can learn complex patterns",
    "TensorFlow makes building models easy",
    "Natural language processing is fascinating",
    "Computer vision uses convolutional networks",
    "Recurrent networks handle sequential data",
    "Transfer learning saves training time",
]
labels = [1, 1, 1, 1, 0, 0, 0, 0]  # Binary labels

# ---- Step 1: Create and fit tokenizer ----
VOCAB_SIZE = 1000
OOV_TOKEN = "<OOV>"  # Out-of-vocabulary token

tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=VOCAB_SIZE,
    oov_token=OOV_TOKEN
)
tokenizer.fit_on_texts(sentences)

# Check vocabulary
word_index = tokenizer.word_index
print(f"Vocabulary size: {len(word_index)}")
print(f"Sample: {list(word_index.items())[:10]}")

# ---- Step 2: Convert text to sequences ----
sequences = tokenizer.texts_to_sequences(sentences)
print(f"\nOriginal: {sentences[0]}")
print(f"Sequence: {sequences[0]}")

# ---- Step 3: Pad sequences ----
MAX_LENGTH = 10
PADDING_TYPE = 'post'
TRUNCATING_TYPE = 'post'

padded = tf.keras.utils.pad_sequences(
    sequences,
    maxlen=MAX_LENGTH,
    padding=PADDING_TYPE,
    truncating=TRUNCATING_TYPE
)

print(f"\nPadded shape: {padded.shape}")  # (8, 10)
print(f"Padded[0]: {padded[0]}")

# ---- Step 4: Build model ----
EMBEDDING_DIM = 16

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_LENGTH),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# ---- CRITICAL: Tokenize test data with the SAME tokenizer ----
# test_sequences = tokenizer.texts_to_sequences(test_sentences)
# test_padded = tf.keras.utils.pad_sequences(test_sequences, maxlen=MAX_LENGTH)
# Never fit a new tokenizer on test data!

Practice Model 3: Multi-Class Text Classification

Classify text into multiple categories, such as news topics or document types. The exam may include tasks like this.

import tensorflow as tf
import numpy as np

# ---- Multi-class text classification pattern ----
# Example: BBC News dataset with 5 categories
# sport, business, politics, tech, entertainment

VOCAB_SIZE = 5000
MAX_LENGTH = 200
EMBEDDING_DIM = 64
NUM_CLASSES = 5

# Assume: sentences (list of strings), labels (list of ints 0-4)
# tokenizer = tf.keras.preprocessing.text.Tokenizer(
#     num_words=VOCAB_SIZE, oov_token="<OOV>"
# )
# tokenizer.fit_on_texts(train_sentences)
# train_sequences = tokenizer.texts_to_sequences(train_sentences)
# train_padded = tf.keras.utils.pad_sequences(train_sequences, maxlen=MAX_LENGTH)

# ---- Model with stacked LSTM ----
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_LENGTH),

    # Stacked Bidirectional LSTM
    tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(64, return_sequences=True)  # return_sequences for stacking
    ),
    tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(32)  # Final LSTM does NOT return sequences
    ),

    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')  # Multi-class
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',  # Integer labels
    metrics=['accuracy']
)

model.summary()

# ---- Alternative: 1D Convolution for text (often faster to train) ----
model_conv = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_LENGTH),
    tf.keras.layers.Conv1D(128, 5, activation='relu'),
    tf.keras.layers.GlobalMaxPooling1D(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])

model_conv.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

Tokenizer Cheat Sheet

# Key Tokenizer parameters and methods

tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=10000,        # Max vocabulary size
    oov_token="<OOV>",      # Token for unknown words
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',  # Chars to remove
    lower=True,             # Convert to lowercase
    split=' '               # Split on spaces
)

# Fit on training text ONLY
tokenizer.fit_on_texts(train_sentences)

# Convert text to sequences
train_seqs = tokenizer.texts_to_sequences(train_sentences)
test_seqs = tokenizer.texts_to_sequences(test_sentences)  # Uses same vocab

# Pad sequences
train_padded = tf.keras.utils.pad_sequences(train_seqs, maxlen=100, padding='post')
test_padded = tf.keras.utils.pad_sequences(test_seqs, maxlen=100, padding='post')

# Common mistakes:
# 1. Fitting tokenizer on test data (data leakage!)
# 2. Using different maxlen for train and test
# 3. Forgetting oov_token (unknown words become 0 = padding)
# 4. Setting num_words too low (losing important vocabulary)
# 5. Not matching VOCAB_SIZE in Embedding layer to num_words

Key Takeaways

💡
  • Always use oov_token in the Tokenizer — without it, unknown words silently become padding
  • Fit the tokenizer only on training data, then use it to transform both train and test
  • Use padding='post' and truncating='post' for consistency
  • Bidirectional LSTM generally outperforms unidirectional LSTM on exam tasks
  • For stacked LSTMs, set return_sequences=True on all layers except the last
  • Start with a small embedding dimension (16-64) and increase only if needed
  • Conv1D + GlobalMaxPooling1D is a fast alternative to LSTM that often works well on exam tasks