Pretrained Models Best Practices Advanced
Essential guidelines for choosing, using, and deploying pretrained models responsibly and effectively in production systems.
Choosing the Right Model
- Define your task clearly — Classification, generation, detection, translation? The task determines the model family.
- Start small — Try the smallest model that meets your accuracy requirements. DistilBERT before BERT, Whisper-tiny before Whisper-large.
- Check benchmarks — Look at model cards for evaluation metrics on standard benchmarks.
- Test on your data — Benchmark performance doesn't guarantee real-world results. Always evaluate on your specific dataset.
- Consider deployment constraints — Model size, latency requirements, hardware availability, and cost.
Model Size vs Performance
| Size Class | Parameters | Use Case | Hardware |
|---|---|---|---|
| Tiny | <100M | Edge, mobile, real-time | CPU, mobile chips |
| Small | 100M-1B | Production APIs, fast inference | Single GPU |
| Medium | 1B-13B | High-quality generation, complex tasks | 1-2 GPUs |
| Large | 13B-70B+ | Maximum quality, research | Multi-GPU, clusters |
License Considerations
| License | Commercial Use | Modify | Notes |
|---|---|---|---|
| Apache 2.0 | Yes | Yes | Most permissive. BERT, DistilBERT, Whisper. |
| MIT | Yes | Yes | Very permissive. Many community models. |
| Llama License | Yes (<700M MAU) | Yes | Meta's Llama models. Must accept terms. |
| CC-BY-NC | No | Yes | Research and non-commercial only. |
| OpenRAIL | Yes (with restrictions) | Yes | Responsible AI license. Stable Diffusion. |
Ethical Use
- Read the model card's limitations and bias sections
- Test for bias on your target demographic before deployment
- Add content filtering for generative models
- Be transparent with users about AI-generated content
- Monitor model outputs in production for harmful content
Version Pinning
# Pin model version with a specific revision model = AutoModel.from_pretrained("bert-base-uncased", revision="a265f77") # Pin library versions in requirements.txt # transformers==4.40.0 # torch==2.2.0
Caching Models
# Set custom cache directory $ export HF_HOME=/path/to/cache # View cached models $ huggingface-cli scan-cache # Clean cache $ huggingface-cli delete-cache
Deployment Strategies
| Strategy | Best For | Tools |
|---|---|---|
| REST API | Web applications | FastAPI, Flask, TorchServe |
| Serverless | Low-traffic, variable load | AWS Lambda, GCP Cloud Functions |
| Hugging Face Inference Endpoints | Managed deployment | HF platform |
| Edge / On-device | Mobile, IoT | ONNX Runtime, TFLite, Core ML |
| Batch processing | Large datasets offline | Ray, Spark, custom pipelines |
Frequently Asked Questions
How do I choose between multiple models for the same task?
Evaluate each model on a representative sample of your data. Consider accuracy, latency, model size, license, and community support. Start with the most popular model for your task and benchmark against alternatives.
Can I use pretrained models offline?
Yes. Download the model once, then load from a local directory: model = AutoModel.from_pretrained("./local-model-dir"). Set HF_HUB_OFFLINE=1 to prevent any network requests.
How do I reduce inference latency?
Use quantization (4-bit or 8-bit), batch inputs, use GPU acceleration, try ONNX Runtime for optimized inference, or use a smaller model variant (distilled, pruned).
What if no pretrained model fits my task?
Fine-tune the closest available model on your data. If the domain is very different (e.g., medical, legal), consider continued pretraining followed by fine-tuning.
Course Complete!
You now have comprehensive knowledge of pretrained models — from discovery to deployment. Apply these skills to build powerful AI applications leveraging the work of the global research community.
← Back to Course Overview
Lilly Tech Systems