Project Setup Beginner
In this project you will build a complete recommendation engine from the ground up. This first lesson covers the system architecture, introduces the MovieLens dataset, defines our tech stack, and walks through environment setup so you can start coding immediately.
System Architecture
Our recommendation engine follows a layered architecture that separates data ingestion, model training, and serving:
Architecture
+------------------+ +---------------------+ +----------------+
| MovieLens | --> | Data Pipeline | --> | Model Layer |
| Dataset | | (pandas, numpy) | | - CF (sklearn)|
+------------------+ +---------------------+ | - Content |
| - NCF (torch) |
+-------+--------+
|
+-------v--------+
| FastAPI |
| REST API |
+-------+--------+
|
+-------v--------+
| Redis Cache |
+----------------+
The MovieLens Dataset
We use MovieLens 100K, a widely-used benchmark dataset from the GroupLens research lab at the University of Minnesota. It contains:
- 100,000 ratings from 943 users on 1,682 movies
- Ratings on a 1–5 integer scale
- Demographic data (age, gender, occupation, zip code)
- Movie metadata (title, release date, genres, IMDb URL)
Why MovieLens? It is the standard benchmark for recommendation system research. Results on MovieLens are directly comparable across academic papers and industry tutorials. The 100K variant is small enough to iterate quickly on a laptop but large enough to demonstrate real patterns.
Tech Stack
| Component | Technology | Purpose |
|---|---|---|
| Language | Python 3.9+ | Core development language |
| Data | pandas, NumPy | Data loading, manipulation, and matrix operations |
| ML | scikit-learn | TF-IDF, cosine similarity, train/test splitting |
| Deep Learning | PyTorch | Neural collaborative filtering model |
| API | FastAPI + Uvicorn | REST API serving and async request handling |
| Caching | Redis | Low-latency recommendation caching |
| Testing | pytest, httpx | Unit and integration tests |
Environment Setup
Create the project structure and install all dependencies:
Bash
# Create project directory mkdir recommendation-engine && cd recommendation-engine # Create virtual environment python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate # Create project structure mkdir -p data models api tests # Install dependencies pip install pandas numpy scikit-learn scipy pip install torch torchvision pip install fastapi uvicorn redis pip install pytest httpx pip install matplotlib seaborn jupyter
requirements.txt
Text
pandas>=2.0.0 numpy>=1.24.0 scikit-learn>=1.3.0 scipy>=1.11.0 torch>=2.0.0 fastapi>=0.100.0 uvicorn>=0.23.0 redis>=5.0.0 pytest>=7.4.0 httpx>=0.24.0 matplotlib>=3.7.0 seaborn>=0.12.0
Download the Dataset
Python
import os import urllib.request import zipfile def download_movielens(data_dir="data"): """Download and extract MovieLens 100K dataset.""" url = "https://files.grouplens.org/datasets/movielens/ml-100k.zip" zip_path = os.path.join(data_dir, "ml-100k.zip") extract_path = os.path.join(data_dir, "ml-100k") if os.path.exists(extract_path): print("Dataset already downloaded.") return extract_path os.makedirs(data_dir, exist_ok=True) print("Downloading MovieLens 100K...") urllib.request.urlretrieve(url, zip_path) print("Extracting...") with zipfile.ZipFile(zip_path, "r") as z: z.extractall(data_dir) os.remove(zip_path) print(f"Dataset ready at {extract_path}") return extract_path # Run it data_path = download_movielens()
Project Directory Structure
Text
recommendation-engine/
data/
ml-100k/ # MovieLens dataset files
u.data # 100K ratings (user_id, item_id, rating, timestamp)
u.item # Movie metadata (id, title, genres...)
u.user # User demographics
models/
collaborative.py # User-based and item-based CF
content_based.py # TF-IDF content filtering
ncf_model.py # Neural collaborative filtering (PyTorch)
hybrid.py # Hybrid recommender combining approaches
api/
main.py # FastAPI application
schemas.py # Pydantic response models
cache.py # Redis caching layer
tests/
test_models.py # Model unit tests
test_api.py # API integration tests
requirements.txt
README.md
Ready to Code: With the environment set up and dataset downloaded, you are ready to begin building. The next lesson covers data loading, exploratory analysis, and building the user-item interaction matrix that all our models will consume.
Next: Data Preparation
Load the MovieLens dataset, explore rating distributions, and build the user-item matrix for our recommendation algorithms.
Step 1: Data Preparation →
Lilly Tech Systems