Using Pretrained Models Intermediate
A step-by-step guide to loading pretrained models, running inference, processing data in batches, accelerating with GPUs, quantizing for speed, and serving models as APIs.
Loading with Hugging Face transformers
The most common way to load pretrained models:
Python
from transformers import AutoModel, AutoTokenizer # Load any model and tokenizer by name model_name = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # Task-specific loading from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained(model_name)
Loading with PyTorch Hub
Python
import torch # Load a model from PyTorch Hub model = torch.hub.load("pytorch/vision", "resnet50", weights="IMAGENET1K_V2") model.eval() # Load YOLO from ultralytics model = torch.hub.load("ultralytics/yolov5", "yolov5s")
Loading with TensorFlow Hub
Python
import tensorflow_hub as hub # Load a TF Hub model model = hub.load("https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/classification/5") # Use as a Keras layer import tensorflow as tf layer = hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim128/2") embeddings = layer(["Hello world"])
Inference on Single Inputs
Python
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") # Tokenize and predict inputs = tokenizer("This is an amazing tutorial!", return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits prediction = torch.argmax(logits, dim=-1) print(model.config.id2label[prediction.item()]) # POSITIVE
Batch Inference
Python
# Process multiple inputs at once for better throughput texts = ["I love this", "This is terrible", "Not bad at all", "Absolutely wonderful"] inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True) with torch.no_grad(): logits = model(**inputs).logits predictions = torch.argmax(logits, dim=-1) for text, pred in zip(texts, predictions): print(f"{text} -> {model.config.id2label[pred.item()]}")
GPU Acceleration
Python
# Move model and inputs to GPU device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device) inputs = tokenizer(text, return_tensors="pt").to(device) # Or load directly to GPU with transformers model = AutoModel.from_pretrained("bert-base-uncased", device_map="auto") # For large models, use half precision model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float16, device_map="auto")
Model Quantization
Reduce model size and speed up inference by quantizing weights from float32 to int8 or int4:
Python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig # 4-bit quantization (reduces memory ~4x) quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quantization_config, device_map="auto" )
Serving Models with FastAPI
Python
from fastapi import FastAPI from transformers import pipeline from pydantic import BaseModel app = FastAPI() classifier = pipeline("sentiment-analysis") class TextInput(BaseModel): text: str @app.post("/predict") def predict(input: TextInput): result = classifier(input.text) return {"prediction": result} # Run with: uvicorn app:app --host 0.0.0.0 --port 8000
Complete Example: Image Classification API
Python
from transformers import pipeline from PIL import Image import requests # Load image classifier classifier = pipeline("image-classification", model="google/vit-base-patch16-224", device=0) # Classify from URL url = "https://example.com/cat.jpg" image = Image.open(requests.get(url, stream=True).raw) results = classifier(image, top_k=5) for r in results: print(f"{r['label']}: {r['score']:.4f}")
Next Up
Learn how to fine-tune pretrained models on your own data with LoRA and the Trainer API.
Next: Fine-tuning →
Lilly Tech Systems