Intermediate

Fine-Tuning

Fine-tuning adapts a pre-trained model to your specific task and dataset. This lesson covers the Trainer API, TrainingArguments, custom dataset preparation, LoRA/PEFT for parameter-efficient fine-tuning, and evaluation strategies. Includes practice questions at the end.

Trainer API

The Trainer API is the primary way to fine-tune models in Hugging Face. It handles the training loop, gradient accumulation, mixed precision, distributed training, and logging automatically.

# Complete fine-tuning workflow with Trainer API
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments
)
from datasets import load_dataset
import numpy as np
from evaluate import load as load_metric

# 1. Load dataset
dataset = load_dataset("imdb")

# 2. Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

# 3. Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length",
                     truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 4. Define evaluation metrics
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# 5. Configure training
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    learning_rate=2e-5,
    evaluation_strategy="epoch",       # Evaluate after each epoch
    save_strategy="epoch",             # Save after each epoch
    load_best_model_at_end=True,       # Load best checkpoint at end
    metric_for_best_model="accuracy",  # Use accuracy to select best
    logging_dir="./logs",
    logging_steps=100,
    fp16=True,                         # Mixed precision training
    push_to_hub=False                  # Set True to push to Hub
)

# 6. Create Trainer and train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics
)

trainer.train()

# 7. Evaluate
results = trainer.evaluate()
print(results)  # {'eval_loss': 0.21, 'eval_accuracy': 0.93, ...}

# 8. Save the model
trainer.save_model("./my-finetuned-model")
tokenizer.save_pretrained("./my-finetuned-model")

Custom Datasets

You must know how to load, preprocess, and tokenize custom datasets for fine-tuning. The datasets library provides tools for this.

# Working with custom datasets
from datasets import load_dataset, Dataset, DatasetDict

# Load from Hugging Face Hub
dataset = load_dataset("glue", "mrpc")       # GLUE benchmark
dataset = load_dataset("squad")              # SQuAD QA dataset
dataset = load_dataset("conll2003")          # NER dataset

# Load from local files
dataset = load_dataset("csv", data_files="my_data.csv")
dataset = load_dataset("json", data_files="my_data.json")

# Create from Python dict
data = {"text": ["I love this", "I hate this"], "label": [1, 0]}
dataset = Dataset.from_dict(data)

# Split into train/test
dataset = dataset.train_test_split(test_size=0.2, seed=42)

# Preprocessing with .map()
def preprocess(examples):
    # Tokenize
    tokenized = tokenizer(examples["text"], padding="max_length",
                          truncation=True, max_length=128)
    return tokenized

# Apply to entire dataset (batched for speed)
processed = dataset.map(preprocess, batched=True,
                        remove_columns=["text"])  # Remove raw text

# Filter examples
filtered = dataset.filter(lambda x: len(x["text"]) > 10)

# Select subset for quick testing
small_train = dataset["train"].select(range(1000))

# Key Dataset methods for the exam:
dataset_methods = {
    ".map()": "Apply function to all examples (batched=True for speed)",
    ".filter()": "Keep only examples matching a condition",
    ".select()": "Select specific indices",
    ".train_test_split()": "Split into train and test sets",
    ".rename_column()": "Rename a column",
    ".remove_columns()": "Remove columns not needed by the model",
    ".set_format()": "Set output format (torch, numpy, etc.)"
}

LoRA / PEFT

Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA allow you to fine-tune large models by training only a small number of additional parameters, dramatically reducing memory and compute requirements.

# LoRA fine-tuning with PEFT library
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForSequenceClassification

# Load base model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,     # Task type
    r=16,                            # Rank of the update matrices
    lora_alpha=32,                   # Scaling factor
    lora_dropout=0.1,                # Dropout for LoRA layers
    target_modules=["query", "value"],  # Which layers to adapt
    bias="none"                      # Don't train bias parameters
)

# Apply LoRA to the model
peft_model = get_peft_model(model, lora_config)

# Check trainable parameters
peft_model.print_trainable_parameters()
# trainable params: 294,912 || all params: 109,778,434 || trainable%: 0.27%

# Train with Trainer as usual
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)
trainer.train()

# Save and load LoRA model
peft_model.save_pretrained("./lora-model")
# Only saves the LoRA weights (small file)

# Load LoRA model back
from peft import PeftModel
base_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
loaded_model = PeftModel.from_pretrained(base_model, "./lora-model")

# Key PEFT concepts:
peft_concepts = {
    "LoRA": "Low-Rank Adaptation - adds small trainable matrices to attention layers",
    "r (rank)": "Rank of update matrices. Lower = fewer params. Typical: 8-64",
    "lora_alpha": "Scaling factor. Rule of thumb: 2x the rank",
    "target_modules": "Which layers to adapt (usually query and value projections)",
    "QLoRA": "LoRA + 4-bit quantization for even lower memory usage",
    "Advantages": "90%+ memory savings, faster training, small saved files"
}

Evaluation

# Evaluation metrics you should know
from evaluate import load

# Common metrics
accuracy = load("accuracy")
f1 = load("f1")
precision = load("precision")
recall = load("recall")
bleu = load("bleu")       # For translation
rouge = load("rouge")     # For summarization
squad = load("squad")     # For QA (exact match + F1)

# Using metrics with Trainer
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    acc = accuracy.compute(predictions=predictions, references=labels)
    f1_score = f1.compute(predictions=predictions, references=labels,
                          average="weighted")

    return {**acc, **f1_score}

# Metric selection guide:
metric_guide = {
    "Classification": "accuracy, f1 (weighted for multi-class)",
    "NER": "seqeval (entity-level precision/recall/F1)",
    "QA": "squad (exact_match + F1)",
    "Summarization": "rouge (rouge1, rouge2, rougeL)",
    "Translation": "bleu (corpus-level), sacrebleu"
}

Practice Questions

💡
Test your knowledge of fine-tuning:
Q1: What is the purpose of TrainingArguments in the Trainer API?

Answer: TrainingArguments configures all hyperparameters and settings for training, including: learning rate, batch size, number of epochs, evaluation strategy, save strategy, warmup steps, weight decay, mixed precision (fp16), logging, and whether to push to the Hub. It separates training configuration from model and data logic.

Q2: Why use batched=True with dataset.map()?

Answer: Setting batched=True processes multiple examples at once instead of one at a time. This is significantly faster because tokenizers are optimized for batch processing. Without it, the map function processes each example individually, which can be 10-100x slower for large datasets.

Q3: What does LoRA's rank parameter (r) control?

Answer: The rank r controls the dimensionality of the low-rank update matrices. A lower rank means fewer trainable parameters (faster training, less memory) but potentially lower quality. A higher rank allows more expressive updates but requires more resources. Typical values range from 8 to 64. For most tasks, r=16 provides a good balance.

Q4: What metric should you use to evaluate a summarization model?

Answer: ROUGE (Recall-Oriented Understudy for Gisting Evaluation). The three main variants are: rouge1 (unigram overlap), rouge2 (bigram overlap), and rougeL (longest common subsequence). ROUGE measures how much of the reference summary content appears in the generated summary.

Q5: What is the advantage of PEFT/LoRA over full fine-tuning?

Answer: PEFT/LoRA trains only 0.1-1% of the model parameters instead of all parameters. This provides: (1) 90%+ memory savings, (2) faster training, (3) smaller saved model files (only LoRA weights), (4) ability to fine-tune large models on consumer GPUs, and (5) easy switching between multiple task-specific adapters on the same base model.

Key Takeaways

💡
  • The Trainer API handles the complete training loop — configure it with TrainingArguments
  • Always tokenize datasets with .map(batched=True) for efficiency
  • Use load_best_model_at_end=True with evaluation to keep the best checkpoint
  • LoRA/PEFT enables fine-tuning large models with minimal resources
  • Choose evaluation metrics based on your task: accuracy for classification, ROUGE for summarization, BLEU for translation