Advanced

Backdoor Removal

Learn proven techniques for neutralizing backdoors from compromised models, including fine-pruning, knowledge distillation, and machine unlearning approaches.

Fine-Pruning

Fine-pruning (Liu et al., 2018) combines neural network pruning with fine-tuning. The key insight is that backdoor behavior relies on neurons that are dormant during clean input processing but activate on triggered inputs.

Python - Fine-Pruning Strategy
def fine_pruning(model, clean_data, prune_rate=0.3, finetune_epochs=10):
    """Remove backdoor by pruning dormant neurons then fine-tuning."""

    # Step 1: Measure neuron activation on clean data
    activations = measure_activations(model, clean_data)

    # Step 2: Identify neurons with low activation on clean data
    # These are candidates for backdoor neurons
    thresholds = compute_pruning_threshold(activations, prune_rate)

    # Step 3: Prune low-activation neurons
    for layer_name, threshold in thresholds.items():
        prune_neurons_below(model, layer_name, threshold)

    # Step 4: Fine-tune on clean data to recover accuracy
    fine_tune(model, clean_data, epochs=finetune_epochs)

    return model

# Fine-pruning reduces attack success rate from >95% to <5%
# while maintaining >95% clean accuracy in most cases

Knowledge Distillation

Distillation trains a clean student model to mimic the teacher model's behavior on clean data. Since the student only learns from clean inputs, it does not inherit the backdoor behavior.

  1. Generate Soft Labels

    Run clean data through the potentially backdoored teacher model. Record the output probability distributions (soft labels).

  2. Train Student Model

    Train a new, randomly initialized student model to match the teacher's soft labels on clean data only.

  3. Validate Removal

    Test the student model against known triggers and clean data to confirm the backdoor was not transferred.

💡
Key advantage: Distillation does not require knowing the trigger or the attack method. It works as a general-purpose defense because the student only learns from clean data behavior.

Removal Methods Comparison

MethodClean Data NeededEffectivenessClean Accuracy Impact
Fine-PruningSmall clean set (5-10%)High for patch triggersMinimal (1-2% drop)
Knowledge DistillationModerate clean setHigh for most attacksModerate (2-5% drop)
Neural Attention DistillationSmall clean setVery highMinimal
Mode Connectivity RepairSmall clean setHighMinimal
Retraining from ScratchFull clean datasetCompleteNone (if data is clean)

Machine Unlearning for Backdoors

Machine unlearning techniques specifically target and remove learned associations between triggers and target classes:

  • Gradient ascent on trigger: Once the trigger is identified, perform gradient ascent on triggered samples to unlearn the association.
  • Influence function removal: Use influence functions to identify and undo the effect of poisoned training samples.
  • Elastic weight consolidation: Protect important clean-task weights while allowing backdoor-related weights to be modified freely.
Verification is critical: After applying any removal technique, always verify with multiple detection methods. Some removal techniques may reduce but not eliminate the backdoor. Adaptive attacks may survive certain removal methods.