Pretrained Models Best Practices Advanced

Essential guidelines for choosing, using, and deploying pretrained models responsibly and effectively in production systems.

Choosing the Right Model

  1. Define your task clearly — Classification, generation, detection, translation? The task determines the model family.
  2. Start small — Try the smallest model that meets your accuracy requirements. DistilBERT before BERT, Whisper-tiny before Whisper-large.
  3. Check benchmarks — Look at model cards for evaluation metrics on standard benchmarks.
  4. Test on your data — Benchmark performance doesn't guarantee real-world results. Always evaluate on your specific dataset.
  5. Consider deployment constraints — Model size, latency requirements, hardware availability, and cost.

Model Size vs Performance

Size ClassParametersUse CaseHardware
Tiny<100MEdge, mobile, real-timeCPU, mobile chips
Small100M-1BProduction APIs, fast inferenceSingle GPU
Medium1B-13BHigh-quality generation, complex tasks1-2 GPUs
Large13B-70B+Maximum quality, researchMulti-GPU, clusters

License Considerations

LicenseCommercial UseModifyNotes
Apache 2.0YesYesMost permissive. BERT, DistilBERT, Whisper.
MITYesYesVery permissive. Many community models.
Llama LicenseYes (<700M MAU)YesMeta's Llama models. Must accept terms.
CC-BY-NCNoYesResearch and non-commercial only.
OpenRAILYes (with restrictions)YesResponsible AI license. Stable Diffusion.
Always check the license before using a model in production. Some models that appear open-source have restrictions on commercial use or require attribution.

Ethical Use

  • Read the model card's limitations and bias sections
  • Test for bias on your target demographic before deployment
  • Add content filtering for generative models
  • Be transparent with users about AI-generated content
  • Monitor model outputs in production for harmful content

Version Pinning

Python
# Pin model version with a specific revision
model = AutoModel.from_pretrained("bert-base-uncased", revision="a265f77")

# Pin library versions in requirements.txt
# transformers==4.40.0
# torch==2.2.0

Caching Models

Terminal
# Set custom cache directory
$ export HF_HOME=/path/to/cache

# View cached models
$ huggingface-cli scan-cache

# Clean cache
$ huggingface-cli delete-cache

Deployment Strategies

StrategyBest ForTools
REST APIWeb applicationsFastAPI, Flask, TorchServe
ServerlessLow-traffic, variable loadAWS Lambda, GCP Cloud Functions
Hugging Face Inference EndpointsManaged deploymentHF platform
Edge / On-deviceMobile, IoTONNX Runtime, TFLite, Core ML
Batch processingLarge datasets offlineRay, Spark, custom pipelines

Frequently Asked Questions

How do I choose between multiple models for the same task?

Evaluate each model on a representative sample of your data. Consider accuracy, latency, model size, license, and community support. Start with the most popular model for your task and benchmark against alternatives.

Can I use pretrained models offline?

Yes. Download the model once, then load from a local directory: model = AutoModel.from_pretrained("./local-model-dir"). Set HF_HUB_OFFLINE=1 to prevent any network requests.

How do I reduce inference latency?

Use quantization (4-bit or 8-bit), batch inputs, use GPU acceleration, try ONNX Runtime for optimized inference, or use a smaller model variant (distilled, pruned).

What if no pretrained model fits my task?

Fine-tune the closest available model on your data. If the domain is very different (e.g., medical, legal), consider continued pretraining followed by fine-tuning.

Course Complete!

You now have comprehensive knowledge of pretrained models — from discovery to deployment. Apply these skills to build powerful AI applications leveraging the work of the global research community.

← Back to Course Overview