Backdoor Removal
Learn proven techniques for neutralizing backdoors from compromised models, including fine-pruning, knowledge distillation, and machine unlearning approaches.
Fine-Pruning
Fine-pruning (Liu et al., 2018) combines neural network pruning with fine-tuning. The key insight is that backdoor behavior relies on neurons that are dormant during clean input processing but activate on triggered inputs.
def fine_pruning(model, clean_data, prune_rate=0.3, finetune_epochs=10): """Remove backdoor by pruning dormant neurons then fine-tuning.""" # Step 1: Measure neuron activation on clean data activations = measure_activations(model, clean_data) # Step 2: Identify neurons with low activation on clean data # These are candidates for backdoor neurons thresholds = compute_pruning_threshold(activations, prune_rate) # Step 3: Prune low-activation neurons for layer_name, threshold in thresholds.items(): prune_neurons_below(model, layer_name, threshold) # Step 4: Fine-tune on clean data to recover accuracy fine_tune(model, clean_data, epochs=finetune_epochs) return model # Fine-pruning reduces attack success rate from >95% to <5% # while maintaining >95% clean accuracy in most cases
Knowledge Distillation
Distillation trains a clean student model to mimic the teacher model's behavior on clean data. Since the student only learns from clean inputs, it does not inherit the backdoor behavior.
Generate Soft Labels
Run clean data through the potentially backdoored teacher model. Record the output probability distributions (soft labels).
Train Student Model
Train a new, randomly initialized student model to match the teacher's soft labels on clean data only.
Validate Removal
Test the student model against known triggers and clean data to confirm the backdoor was not transferred.
Removal Methods Comparison
| Method | Clean Data Needed | Effectiveness | Clean Accuracy Impact |
|---|---|---|---|
| Fine-Pruning | Small clean set (5-10%) | High for patch triggers | Minimal (1-2% drop) |
| Knowledge Distillation | Moderate clean set | High for most attacks | Moderate (2-5% drop) |
| Neural Attention Distillation | Small clean set | Very high | Minimal |
| Mode Connectivity Repair | Small clean set | High | Minimal |
| Retraining from Scratch | Full clean dataset | Complete | None (if data is clean) |
Machine Unlearning for Backdoors
Machine unlearning techniques specifically target and remove learned associations between triggers and target classes:
- Gradient ascent on trigger: Once the trigger is identified, perform gradient ascent on triggered samples to unlearn the association.
- Influence function removal: Use influence functions to identify and undo the effect of poisoned training samples.
- Elastic weight consolidation: Protect important clean-task weights while allowing backdoor-related weights to be modified freely.
Lilly Tech Systems