Model Fingerprinting
Unlike watermarking, fingerprinting is a non-invasive technique that identifies models by observing their unique behavioral patterns. Every model has a distinct "decision boundary fingerprint" that transfers to copies and derivatives.
How Model Fingerprinting Works
Models trained on different data or with different hyperparameters develop unique decision boundaries. Even models with identical architectures produce slightly different outputs for edge-case inputs. Fingerprinting exploits this to identify copied models.
Conferrable Adversarial Examples
The key technique in model fingerprinting: adversarial examples that transfer from an original model to its copies but not to independently trained models.
import torch import torch.nn.functional as F def generate_fingerprint(model, base_inputs, num_probes=100, epsilon=0.05): """Generate fingerprint probes from the model's decision boundary.""" fingerprint_probes = [] for x in base_inputs[:num_probes]: x.requires_grad = True output = model(x.unsqueeze(0)) loss = -F.cross_entropy(output, output.argmax(dim=1)) loss.backward() # Craft adversarial example near decision boundary probe = x + epsilon * x.grad.sign() original_pred = model(x.unsqueeze(0)).argmax() probe_pred = model(probe.unsqueeze(0)).argmax() # Keep probes where prediction changes (near boundary) if original_pred != probe_pred: fingerprint_probes.append({ "input": probe.detach(), "expected_pred": probe_pred.item() }) return fingerprint_probes def verify_fingerprint(suspect_model, fingerprint_probes, threshold=0.7): """Check if suspect model matches our fingerprint.""" matches = 0 for probe in fingerprint_probes: pred = suspect_model(probe["input"].unsqueeze(0)).argmax().item() if pred == probe["expected_pred"]: matches += 1 match_rate = matches / len(fingerprint_probes) return match_rate > threshold # Copies share decision boundaries
LLM Fingerprinting
For large language models, fingerprinting involves crafting prompts that elicit unique response patterns:
- Style fingerprints: Specific prompts where the model consistently uses particular phrases, sentence structures, or formatting.
- Knowledge fingerprints: Questions about obscure topics where the model's training data produces distinctive (possibly incorrect) answers.
- Behavioral fingerprints: Edge-case prompts where the model exhibits unique refusal patterns, hedging language, or reasoning patterns.
- Token probability fingerprints: Measure the probability distribution over tokens for specific prompts — copied models produce similar distributions.
Fingerprinting vs. Independent Models
Limitations of Fingerprinting
- Not a positive signal: Fingerprinting detects similarity, not an intentionally embedded proof of ownership.
- Can degrade with modification: Heavy fine-tuning can shift decision boundaries enough to weaken the fingerprint.
- Requires baseline: You need access to probe the suspect model with your fingerprint inputs.
- False positive risk: Models trained on similar data may share some boundary characteristics.