Using Pretrained Models Intermediate

A step-by-step guide to loading pretrained models, running inference, processing data in batches, accelerating with GPUs, quantizing for speed, and serving models as APIs.

Loading with Hugging Face transformers

The most common way to load pretrained models:

Python

from transformers import AutoModel, AutoTokenizer

# Load any model and tokenizer by name
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Task-specific loading
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Loading with PyTorch Hub

Python

import torch

# Load a model from PyTorch Hub
model = torch.hub.load("pytorch/vision", "resnet50", weights="IMAGENET1K_V2")
model.eval()

# Load YOLO from ultralytics
model = torch.hub.load("ultralytics/yolov5", "yolov5s")

Loading with TensorFlow Hub

Python

import tensorflow_hub as hub

# Load a TF Hub model
model = hub.load("https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/classification/5")

# Use as a Keras layer
import tensorflow as tf
layer = hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim128/2")
embeddings = layer(["Hello world"])

Inference on Single Inputs

Python

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Tokenize and predict
inputs = tokenizer("This is an amazing tutorial!", return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
prediction = torch.argmax(logits, dim=-1)
print(model.config.id2label[prediction.item()])  # POSITIVE

Batch Inference

Python

# Process multiple inputs at once for better throughput
texts = ["I love this", "This is terrible", "Not bad at all", "Absolutely wonderful"]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits
predictions = torch.argmax(logits, dim=-1)

for text, pred in zip(texts, predictions):
    print(f"{text} -> {model.config.id2label[pred.item()]}")

GPU Acceleration

Python

# Move model and inputs to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"

model = model.to(device)
inputs = tokenizer(text, return_tensors="pt").to(device)

# Or load directly to GPU with transformers
model = AutoModel.from_pretrained("bert-base-uncased", device_map="auto")

# For large models, use half precision
model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float16, device_map="auto")

Model Quantization

Reduce model size and speed up inference by quantizing weights from float32 to int8 or int4:

Python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization (reduces memory ~4x)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=quantization_config,
    device_map="auto"
)

Serving Models with FastAPI

Python

from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel

app = FastAPI()
classifier = pipeline("sentiment-analysis")

class TextInput(BaseModel):
    text: str

@app.post("/predict")
def predict(input: TextInput):
    result = classifier(input.text)
    return {"prediction": result}

# Run with: uvicorn app:app --host 0.0.0.0 --port 8000

Complete Example: Image Classification API

Python

from transformers import pipeline
from PIL import Image
import requests

# Load image classifier
classifier = pipeline("image-classification", model="google/vit-base-patch16-224", device=0)

# Classify from URL
url = "https://example.com/cat.jpg"
image = Image.open(requests.get(url, stream=True).raw)
results = classifier(image, top_k=5)

for r in results:
    print(f"{r['label']}: {r['score']:.4f}")

Next Up

Learn how to fine-tune pretrained models on your own data with LoRA and the Trainer API.

Next: Fine-tuning →

← Multi-Modal Models Fine-tuning →