Pretrained Models Best Practices Advanced

Essential guidelines for choosing, using, and deploying pretrained models responsibly and effectively in production systems.

Choosing the Right Model

Define your task clearly — Classification, generation, detection, translation? The task determines the model family.
Start small — Try the smallest model that meets your accuracy requirements. DistilBERT before BERT, Whisper-tiny before Whisper-large.
Check benchmarks — Look at model cards for evaluation metrics on standard benchmarks.
Test on your data — Benchmark performance doesn't guarantee real-world results. Always evaluate on your specific dataset.
Consider deployment constraints — Model size, latency requirements, hardware availability, and cost.

Model Size vs Performance

Size Class	Parameters	Use Case	Hardware
Tiny	<100M	Edge, mobile, real-time	CPU, mobile chips
Small	100M-1B	Production APIs, fast inference	Single GPU
Medium	1B-13B	High-quality generation, complex tasks	1-2 GPUs
Large	13B-70B+	Maximum quality, research	Multi-GPU, clusters

License Considerations

License	Commercial Use	Modify	Notes
Apache 2.0	Yes	Yes	Most permissive. BERT, DistilBERT, Whisper.
MIT	Yes	Yes	Very permissive. Many community models.
Llama License	Yes (<700M MAU)	Yes	Meta's Llama models. Must accept terms.
CC-BY-NC	No	Yes	Research and non-commercial only.
OpenRAIL	Yes (with restrictions)	Yes	Responsible AI license. Stable Diffusion.

Always check the license before using a model in production. Some models that appear open-source have restrictions on commercial use or require attribution.

Ethical Use

Read the model card's limitations and bias sections
Test for bias on your target demographic before deployment
Add content filtering for generative models
Be transparent with users about AI-generated content
Monitor model outputs in production for harmful content

Version Pinning

Python

# Pin model version with a specific revision
model = AutoModel.from_pretrained("bert-base-uncased", revision="a265f77")

# Pin library versions in requirements.txt
# transformers==4.40.0
# torch==2.2.0

Caching Models

Terminal

# Set custom cache directory
$ export HF_HOME=/path/to/cache

# View cached models
$ huggingface-cli scan-cache

# Clean cache
$ huggingface-cli delete-cache

Deployment Strategies

Strategy	Best For	Tools
REST API	Web applications	FastAPI, Flask, TorchServe
Serverless	Low-traffic, variable load	AWS Lambda, GCP Cloud Functions
Hugging Face Inference Endpoints	Managed deployment	HF platform
Edge / On-device	Mobile, IoT	ONNX Runtime, TFLite, Core ML
Batch processing	Large datasets offline	Ray, Spark, custom pipelines

Frequently Asked Questions

How do I choose between multiple models for the same task?

Evaluate each model on a representative sample of your data. Consider accuracy, latency, model size, license, and community support. Start with the most popular model for your task and benchmark against alternatives.

Can I use pretrained models offline?

Yes. Download the model once, then load from a local directory: model = AutoModel.from_pretrained("./local-model-dir"). Set HF_HUB_OFFLINE=1 to prevent any network requests.

How do I reduce inference latency?

Use quantization (4-bit or 8-bit), batch inputs, use GPU acceleration, try ONNX Runtime for optimized inference, or use a smaller model variant (distilled, pruned).

What if no pretrained model fits my task?

Fine-tune the closest available model on your data. If the domain is very different (e.g., medical, legal), consider continued pretraining followed by fine-tuning.

Course Complete!

You now have comprehensive knowledge of pretrained models — from discovery to deployment. Apply these skills to build powerful AI applications leveraging the work of the global research community.

← Back to Course Overview

← Fine-tuning Course Overview →